Table of contents

This documentation is intended for internal use by the GDS community.

About us

Platforms

Reliability Engineering develops, maintains and supports the Amazon Web Services (AWS) and the GOV.UK PaaS infrastructure GDS uses. GDS teams are responsible for the applications and components running on these platforms.

For example, Reliability Engineering is responsible for managing the SSL certificates that handle communications between AWS environments and virtual machines. GDS teams are responsible for managing certificates that protect messages used by their service.

Monitoring

Reliability Engineering monitors AWS and the GOV.UK PaaS ensuring their availability. GDS teams must monitor their own applications and respond to alerts.

Capacity

Reliability Engineering helps GDS teams manage their capacity until they have the capability to manage their own resources. GDS teams are responsible for performance testing their applications and fixing related problems at code level.

Reliability Engineering holds quarterly meetings, where GDS teams can:

  • discuss scaling their environment
  • address issues in service performance

For example, if there’s an incident that could cause a performance spike, GDS teams should notify Reliability Engineering as soon as possible. This allows Reliability Engineering to make any related changes to a team’s environment.

Tools

Reliability Engineering provides tools to help GDS teams manage their environment. GDS teams can choose other tools if they develop, maintain and support them.

These tools have been procured for use by GDS teams, we are updating our recommendations on how we use them.

Service Levels

Reliability Engineering supports each GDS team’s existing service levels until standardised support is agreed across GDS.