Table of contents

This documentation is intended for internal use by the GDS community.

Reliability Engineering

Reliability Engineering provide a shared platform to GDS teams comprising of tools to set up and maintain a service by:

  • acquiring tools and where appropriate administers them like Logit
  • running off-the-shelf services as internal SaaS such as Prometheus
  • providing patterns and guidance like the PaaS incident process

To understand the context for our decisions and guidance refer to the Service Manual and the GDS Way.

The Reliability Engineering documentation found on this site is intended to help the rest of GDS find out what Reliability Engineering is and what we’re doing. If you’re a member of Reliability Engineering or just curious about our team processes and ongoing work then please take a look at our Team Manual.