Reliability Engineering
Reliability Engineering provide a shared platform to GDS teams comprising of tools to set up and maintain a service by:
- acquiring tools and where appropriate administers them like Logit
- running off-the-shelf services as internal SaaS such as Prometheus and Concourse
- providing patterns and guidance like the PaaS incident process
To understand the context for our decisions and guidance refer to:
- The GDS Technology & Operations Principles
- The Reliability Engineering Strategy
- The GDS Technology & Operations Shared Responsibility Model
- The Service Manual
- The GDS Way
The Reliability Engineering documentation found on this site is intended to help the rest of GDS find out what Reliability Engineering is and what we’re doing. If you’re a member of Reliability Engineering or just curious about our team processes and ongoing work then please take a look at our Team Manual.