About us
Platforms
Reliability Engineering develops, maintains and supports the Amazon Web Services (AWS) and the GOV.UK PaaS infrastructure GDS uses. GDS teams are responsible for the applications and components running on these platforms.
For example, Reliability Engineering is responsible for managing the SSL certificates that handle communications between AWS environments and virtual machines. GDS teams are responsible for managing certificates that protect messages used by their service.
Monitoring
Reliability Engineering monitors AWS and the GOV.UK PaaS ensuring their availability. GDS teams must monitor their own applications and respond to alerts.
Capacity
Reliability Engineering helps GDS teams manage their capacity until they have the capability to manage their own resources. GDS teams are responsible for performance testing their applications and fixing related problems at code level.
Reliability Engineering holds quarterly meetings, where GDS teams can:
- discuss scaling their environment
- address issues in service performance
For example, if there’s an incident that could cause a performance spike, GDS teams should notify Reliability Engineering as soon as possible. This allows Reliability Engineering to make any related changes to a team’s environment.
Tools
Reliability Engineering provides tools to help GDS teams manage their environment. GDS teams can choose other tools if they develop, maintain and support them.
These tools have been procured for use by GDS teams, we are updating our recommendations on how we use them.
- Confluence collaboration and shared workspace
- Jira to plan, track, and release software
- Sentry open-source error tracking
- PagerDuty for operations management
- Statuspage incident communication tool
- Pingdom for website performance monitoring
- Zendesk customer service and engagement platform
- Amazon EC2 Reserved Instances virtual computing
- Tech Docs Template to build technical documentation using a GOV.UK style
Service Levels
Reliability Engineering supports each GDS team’s existing service levels until standardised support is agreed across GDS.