Technology & Operations Shared Responsibility Model
TechOps (including Reliability Engineering, Cyber Security and User Support) uses a shared responsibility model to provide a supported platform to GDS product teams. Broadly, TechOps builds, runs and maintains the infrastructure and supporting services, while the product teams maintain the applications that they wish to run on the platform.
Summary of responsibilities
|Activity||Product team||Reliability Engineering||Cyber Security|
|Provide account (e.g. AWS, PagerDuty, Alert Manager)||✓|
|Provide account (e.g. Splunk)||✓|
|Create build/deploy pipeline||✓||✓*|
|Respond to incidents||✓||✓**||✓**|
|Respond to other alerts||✓||✓**|
|Set up infrastructure||✓|
|Set up & structure logging||✓||✓*||✓*|
|Provide tooling for logging||✓||✓|
|Monitor the reliability of key user journeys (SLIs)||✓||✓*|
|Set shared reliability goals for each SLI (SLOs)||✓||✓|
|Agree on policy for breaking SLOs||✓||✓|
* = in a supporting role/through documentation ** = if called upon / if the main team can’t deal with the situation
If you (a product team) experience problems with the GDS Supported Platform, contact us using the #techops Slack channel.
Contact Cyber Security using the #cyber-security-help Slack channel.
Provisioning Amazon Web Services (AWS) accounts and infrastructure
When a team has decided to use the GDS Supported Platform for hosting their application or service, Reliability Engineering (RE) will create an AWS account for the team (unless one exists for their wider programme, for example, GOV.UK or GOV.UK Verify) and provision the underlying infrastructure for teams to deploy their app.
Updates and upgrades
When infrastructural components need to be upgraded/updated, RE will do so for each team’s infrastructure. This may happen often and will rarely impact the service or their users, so RE will not notify the service team before performing the upgrade/update unless there will be a discernible impact.
Setting up a build/deploy pipeline
Reliability Engineering will help teams to set up a build and deployment pipeline (otherwise known as a continuous integration and deployment (CI/CD) pipeline) as part of helping teams migrate to either the PaaS or the GSP.
This includes providing the tooling and documentation for product teams, as well as consulting/working with teams to get an initial pipeline setup.
RE will always be available on a consultancy basis, should product teams need help.
RE is also responsible for providing tooling to monitor the pipeline, for example to monitor the progress (and success/failure) of jobs that are running.
Logging, monitoring and alerting
TechOps will provide logging, monitoring and alerting for the GDS Supported Platform. We will work with teams to ensure that it is set up in the most useful way for teams.
RE will ship logs to CloudWatch for further distribution (for example, to Splunk for protective monitoring).
Cyber Security will then work with teams to establish use cases for teams’ protective monitoring using Splunk. Splunk will alert service teams when those use cases are triggered. Playbooks will include helpful information as to how the service team should deal with a given situation.
If the service team is unable to respond to the situation themselves or if it is categorised as a security incident, they should contact the Cyber Security team either on Slack or through the out-of-hours procedure, if necessary.
Cyber Security will also provide access to Splunk. Service teams will then be able to use Splunk to maintain their current protective monitoring use cases and develop new ones over time.
If, when dealing with service issues, security events/incidents, etc., service teams identify infrastructural issues, TechOps (RE or Cyber Security, depending on where the issue occurs) will work with the service team to resolve the issue.
Due to the way in which applications are deployed and the production of logs occurs, TechOps will not be responsible for the structuring of logs or consistent identification of log streams (see Logging section in Product team responsibilities below).
TechOps will provide:
- health and reliability monitoring using Prometheus
- alerting using Prometheus’ AlertManager and PagerDuty.
TechOps will work with teams to identify the most valuable things to monitor and alert on (including the protective monitoring offering) and then work together to implement the correct logging, monitoring and alerting to facilitate this.
TechOps will set up health and reliability monitoring and alerting to ensure that we provide reliable and secure infrastructure, in line with our service level objectives (SLOs).
Ensuring the integrity of deployments
TechOps will provide tooling and enforcement of measures that ensure that the right code gets deployed to production in the right way.
Specifically, TechOps will enforce the established best practice of “two-eyes” (having at least two people look at the code before it is merged to the master branch of a repository), which will mean that no deployments can be made without being reviewed and signed by at least two authorised people.
Responding to security events and incidents and acting on vulnerabilities
Reliability Engineering (RE) will monitor CVEs that relate to the technologies that underpin the infrastructure that we provide. RE will then work to mitigate any identified vulnerabilities that affect the platform as quickly as possible.
Cyber Security will provide tooling and advice to teams, in order that they can identify and act upon security events and incidents.
- Cloud Security Watch, a tool to detect misconfigurations in AWS
- Splunk, a security information and event management (SIEM) tool to enable protective monitoring
- working with teams to conduct threat modelling
- working with teams to identify use cases to mitigate risks that need to be protectively monitored
- supporting teams to ingest data sources to Splunk
- helping to set up protective monitoring and alerting
TechOps will control access to the underlying infrastructure using AWS IAM roles. TechOps is therefore responsible for allowing and removing this access in a timely manner.
All changes to user access and to the platform go through a two-eyes process to ensure a robust and secure approach to user access control.
Product team responsibilities
Updates and upgrades
Product teams will be responsible for any updates/upgrades to the application and any upstream dependencies (such as libraries and packages).
TechOps will, however, make this as easy as possible to implement, as changes such as these can be put into production by making the changes in the code base before merging them to the master branch of the repository.
Setting up and maintaining a build/deploy pipeline
Product teams will need to work with TechOps to set up the initial pipeline, including learning (through documentation that TechOps provides and through working together) how to do so themselves.
Once the initial pipeline is established, the product team will be responsible for making sure that tests and specific procedures (for example. linting, formatting, promoting) are maintained as the application(s) develop(s).
Product teams are responsible for monitoring their pipelines for successful or failing jobs and acting upon that information.
Product teams are responsible for ensuring that they configure their applications and Docker images to ship logs.
TechOps will develop guidance on logging in the near future. Refer to the Splunk guidelines for now.
Some brief examples can be seen on the Splunk website.
Similarly, product teams must name their containers sensibly and logically for consistent identification of log streams. Again, we will provide more detailed guidance soon.
Monitoring and responding to alerts
It is the product team’s responsibility to monitor their product/service (using tooling provided by TechOps) and to respond to both monitoring and alerts as required. If the product team is unable to respond to the situation themselves or if it is categorised as a security incident, they should contact the Cyber Security team either on Slack or through the out-of-hours procedure, if necessary.
Product teams are responsible for the development, maintenance and engineering of their use cases in Splunk. As their products develop, they should ensure their protective monitoring stays effective and relevant. Cyber Security will support teams where required.
The product team is also expected to allocate time to defining service level indicators (SLIs) and objectives (SLOs), so that we can collectively set up the correct monitoring and alerting.
Product teams must ensure that they control access to GitHub. As any code that is merged to the master branch of a repository will automatically be deployed, it is vital that only trusted individuals are able to merge pull requests.
Third party responsibilities
Third party providers are responsible for making sure their systems are updated and available.
Amazon maintains the AWS infrastructure and is responsible for updating it. Reliability Engineering and product teams do not need to update or upgrade AWS.