Reliability Engineering provide a shared platform to GDS teams comprising of tools to set up and maintain a service by:
- acquiring tools and where appropriate administer them for GDS like Logit
- running off-the-shelf services for GDS as internal SaaS such as Prometheus
- providing patterns and guidance like the PaaS incident templates
Reliability Engineering develops, maintains and supports the Amazon Web Services (AWS) and the GOV.UK PaaS infrastructure GDS uses. GDS teams are responsible for the applications and components running on these platforms.
For example, Reliability Engineering is responsible for managing the SSL certificates that handle communications between AWS environments and virtual machines. GDS teams are responsible for managing certificates that protect messages used by their service.
Reliability Engineering helps GDS teams manage their capacity until they have the capability to manage their own resources. GDS teams are responsible for performance testing their applications and fixing related problems at code level.
Reliability Engineering holds quarterly meetings, where GDS teams can:
- discuss scaling their environment
- address issues in service performance
For example, if there’s an incident that could cause a performance spike, GDS teams should notify Reliability Engineering as soon as possible. This allows Reliability Engineering to make any related changes to a team’s environment.
Reliability Engineering provides tools to help GDS teams manage their environment. GDS teams can choose other tools if they develop, maintain and support them.
These tools have been procured for use by GDS teams, we are updating our recommendations on how we use them.
- Amazon EC2 Reserved Instances
- Tech Docs Template
Reliability Engineering helps GDS teams:
- integrate Logit into their environments
- create their logging and usage policies
These guides will help you:
- sign into Logit
- remove users from Logit
- send logs securely to Logit
- send logs from PaaS to logit
- respond to an incident with Logit
Metrics and Alerting
Reliability Engineering is running a beta service for GDS using Prometheus for operational metrics.
Reliability Engineering provides client libraries which wrap Prometheus’s own libraries so we can:
- provide an easy metrics choice for GDS teams
- supply consistent metrics and naming across different runtimes
- solve problems like, how to get metrics from all worker processes not just one
- guard the
/metricsAPI behind HTTP basic auth for GOV.UK PaaS apps
- ease configuration by using framework-specific things such as Railties or Dropwizard bundles
You can setup GDS metrics for your GOV.UK PaaS app using the Ruby and Java Dropwizard guides on GitHub:
Once you’ve setup your GOV.UK PaaS app with GDS metrics you can:
When using GDS metrics you can create:
Please contact us on the #re-prometheus-support Slack channel to find out more.
GDS Metrics is currently in beta. These instructions are subject to change.
Infrastructure as a Service
Several teams in GDS use Amazon Web Services (AWS) as their infrastructure provider.
GDS teams manage their own AWS accounts, but users must sign into a shared base AWS account managed by Reliability Engineering. You can find out:
Reliability Engineering supports each GDS team’s existing service levels until standardised support is agreed across GDS.