Table of contents

This documentation is intended for internal use by the GDS community.

Reliability Engineering

Reliability Engineering provide a shared platform to GDS teams comprising of tools to set up and maintain a service by:

  • acquiring tools and where appropriate administer them for GDS like Logit
  • running off-the-shelf services for GDS as internal SaaS such as Prometheus
  • providing patterns and guidance like the PaaS incident templates

To understand the context for our decisions and guidance you’ll need to refer to the Service Manual and the GDS Way.

Platforms

Reliability Engineering develops, maintains and supports the Amazon Web Services (AWS) and the GOV.UK PaaS infrastructure GDS uses. GDS teams are responsible for the applications and components running on these platforms.

For example, Reliability Engineering is responsible for managing the SSL certificates that handle communications between AWS environments and virtual machines. GDS teams are responsible for managing certificates that protect messages used by their service.

Monitoring

Reliability Engineering monitors Amazon Web Services (AWS) and the GOV.UK PaaS ensuring their availability. GDS teams must monitor their own applications and respond to alerts.

Capacity

Reliability Engineering helps GDS teams manage their capacity until they have the capability to manage their own resources. GDS teams are responsible for performance testing their applications and fixing related problems at code level.

Reliability Engineering holds quarterly meetings, where GDS teams can:

  • discuss scaling their environment
  • address issues in service performance

For example, if there’s an incident that could cause a performance spike, GDS teams should notify Reliability Engineering as soon as possible. This allows Reliability Engineering to make any related changes to a team’s environment.

Tools

Reliability Engineering provides tools to help GDS teams manage their environment. GDS teams can choose other tools if they develop, maintain and support them.

These tools have been procured for use by GDS teams, we are updating our recommendations on how we use them.

Logging

Reliability Engineering uses Logit to provision, manage and ensure availability of our logging infrastructure and provide ELK (Elasticsearch, Logstash, and Kibana) stacks.

Reliability Engineering helps GDS teams:

  • integrate Logit into their environments
  • create their logging and usage policies

These guides will help you:

Metrics and Alerting

Reliability Engineering is running a beta service for GDS using Prometheus for operational metrics.

Reliability Engineering provides client libraries which wrap Prometheus’s own libraries so we can:

  • provide an easy metrics choice for GDS teams
  • supply consistent metrics and naming across different runtimes
  • solve problems like, how to get metrics from all worker processes not just one
  • guard the /metrics API behind HTTP basic auth for GOV.UK PaaS apps
  • ease configuration by using framework-specific things such as Railties or Dropwizard bundles

You can setup GDS metrics for your GOV.UK PaaS app using the Ruby and Java Dropwizard guides on GitHub:

Once you’ve setup your GOV.UK PaaS app with GDS metrics you can:

When using GDS metrics you can create:

Please contact us on the #re-prometheus-support Slack channel to find out more.

GDS Metrics is currently in beta. These instructions are subject to change.

Infrastructure as a Service

Several teams in GDS use Amazon Web Services (AWS) as their infrastructure provider.

GDS teams manage their own AWS accounts, but users must sign into a shared base AWS account managed by Reliability Engineering. You can find out:

Service Levels

Reliability Engineering supports each GDS team’s existing service levels until standardised support is agreed across GDS.