Table of contents

This documentation is intended for internal use by the GDS community.

Metrics and Alerting

The Reliability Engineering team is running a beta service for GDS using Prometheus for operational metrics and alerting.

Contact us on the #re-prometheus-support Slack channel to find out more.

Expose app level metrics using client libraries

Reliability Engineering provide client libraries which wrap Prometheus’s own libraries to:

  • provide an easy metrics choice for GDS teams
  • supply consistent metrics and naming across different runtimes
  • solve problems like, how to get metrics from all worker processes not just one
  • guard the /metrics API behind HTTP basic auth for GOV.UK PaaS apps
  • ease configuration by using framework-specific things such as Railties or Dropwizard bundles

You can setup GDS metrics for your GOV.UK PaaS app using the Ruby, Python and Java Dropwizard guides on GitHub to instrument:

Expose container level metrics using paas-prometheus-exporter

Cloud Foundry provides time-series data (metrics), for your PaaS apps.

Currently supported metrics are:

  • CPU
  • RAM
  • disk usage data
  • app crashes
  • app requests
  • app response times

Set up the metrics exporter app

Before you setting up the metrics exporter app, you’ll need a live Cloud Foundry account assigned to the spaces you want to receive metrics on.

Your new account should be separate to your primary Cloud Foundry account and use the SpaceAuditor role beause it can view app data without modifying it.

To set up the metrics exporter app:

  1. Clone the paas-prometheus-exporter GitHub repository.
  2. Push the metrics exporter app to Cloud Foundry (without starting the app) by running
    • cf push -f manifest.yml --no-start <app-name>
  3. Set the following mandatory environment variables in the metrics exporter app using

    • cf set-env <app-name> NAME VALUE

    Use the cf set-env command for these mandatory variables, this will keep secure the secret information contained in them.

    Name Value
    API_ENDPOINT Use https://api.cloud.service.gov.uk
    USERNAME Cloud Foundry user
    PASSWORD Cloud Foundry password

    You could also set environment variables by amending the manifest file for optional environment variables that do not contain secret information. Read [paas-metric-exporter][] GitHub repository for more information.

  4. Run cf start <app-name> to start your app

  5. Check you’re generating Prometheus metrics at the metrics endpoint

    • https://<app-name>.cloudapps.digital/metrics
  6. Bind your app to the Prometheus service

    • cf bind-service <app-name> <service-instance-name>

IP whitelist your app

IP whitelist is a security feature often used for limiting and controlling access to an app to trusted users. It works by only allowing traffic through from a list of trusted IP addresses or IP ranges.

By using the re-ip-whitelist-service you will only allow traffic from GDS Prometheis and GDS Office IPs.

  1. Register the IP whitelist route service as a user-provided service in your PaaS space.

    cf create-user-provided-service re-ip-whitelist-service -r https://re-ip-whitelist-service.cloudapps.digital

  2. Register the route service for routes you want to protect.

    cf bind-route-service cloudapps.digital re-ip-whitelist-service --hostname <your paas app route>

    For example, app-to-protect.cloudapps.digital would be:

    cf bind-route-service cloudapps.digital re-ip-whitelist-service --hostname app-to-protect

Troubleshooting

If you’re not receiving metrics, check the logs for the metrics exporter app or contact us at #re-prometheus-support Slack channel.

Further reading

The Service Manual as more information about monitoring the status of your service.

Bind your exporter to Prometheus

Prometheus uses service discovery to decide what it monitors, so for apps running on GOV.UK PaaS you’ll need to:

  1. Grant Prometheus read-only access to your PaaS spaces.
  2. Bind your apps to the Prometheus service.

Grant Prometheus read-only access to your PaaS spaces

By giving the prometheus-for-paas user the SpaceAuditor role you allow it to monitor each instance of your app and respond to events, for example start, stop, scaling.

cf set-space-role prometheus-for-paas@digital.cabinet-office.gov.uk <org-name> <space-name> SpaceAuditor

Bind your apps to the Prometheus service

You can find Prometheus in the PaaS marketplace.

❯ cf marketplace
service          plans        description
gds-prometheus   prometheus   GDS internal Prometheus monitoring alpha https://reliability-engineering.cloudapps.digital/#metrics

If you’re unable to see gds-prometheus in the output of cf marketplace please contact us through the #re-prometheus-support Slack channel.

Create a Prometheus service within your PaaS space and allow it to bind to apps running there. Do this by following these steps:

  1. Create a Prometheus service instance in each space you have Prometheus instrumented apps deployed
    • cf create-service gds-prometheus prometheus <service-instance-name>
  2. Either update your app’s manifest.yml to bind your new service or bind it using the CLI
    • cf bind-service <app-name> <service-instance-name>
  3. If your app uses a custom domain, make sure the authorization header and other app headers are being forwarded to the app, or the basic auth to the metrics endpoint will fail and your app won’t have access to your other headers.
    • cf update-service <cdn-route-service> \ -c '{"domain": "<custom-domain>", "headers": ["Accept", "Authorization", "<other app header values>"]}'

Within 10 minutes Prometheus will start scraping your application for metrics, you can validate this by checking Grafana.

When using zero downtime plugins or a blue-green deployment process

IP Whitelist your applications metrics endpoint

If you’re using a blue-green deployment process with a zero downtime plugin such as autopilot you should disable the basic auth on the metrics endpoint when using the Ruby gem or Python library and instead protect the metrics endpoint using IP whitelisting in order to minimise gaps in metrics between deployments.

By using the re-ip-whitelist-service you will only allow traffic from GDS Prometheis and GDS Office IPs.

  1. Map the route to the metrics path:
  • Update your manifest.yml:
      routes:
      ...
      - route: app-to-protect.cloudapps.digital/metrics
  • redeploy your app to map the route and path
  1. Register the IP whitelist route service as a user-provided service in your PaaS space.

    cf create-user-provided-service re-ip-whitelist-service -r https://re-ip-whitelist-service.cloudapps.digital

  2. Register the IP whitelist route service against the metrics path.

    cf bind-route-service cloudapps.digital re-ip-whitelist-service --hostname app-to-protect --path metrics

Update your Grafana panel to combine metrics for blue-green deployments

You should update your Grafana panels to combine metrics from different deployment states, for example in order to show the number of healthy instances you can use regex:

sum(up{job=~"app-to-protect(-venerable)?"})
  • Note: if you are not using autopilot you will have to substitute -venerable to whatever your zero downtime plugin is using during the app renaming.

App route configuration

Whether or not you’re using a custom domain the Prometheus service broker will only scrape your app’s first route.

For example, some apps may have multiple routes especially if they use custom domains such as custom.domain.gov.uk, an-interesting-app.cloudapps.digital. Only the first route will be picked up.

If there are no routes to your app the Prometheus service will default the route to:

<app-name>.cloudapps.digital

Display create and edit dashboards using Grafana

Reliability Engineering provides Grafana dashboards for teams to view Prometheus scraped metrics. You can use our example dashboards or create your own based on your team’s needs.

Sign in with Grafana at grafana-paas.cloudapps.digital using your GDS Google account.

Display dashboards

Team dashboards are organised into separate Grafana folders.

Select Home in the top left of the Grafana Home Dashboard to choose your team’s dashboard folder and view your dashboards.

Display using your team TV

You should use your TV’s Google Chromebit user to display your Grafana dashboard. The Google Chromebit user only has read access to Grafana.

You should not use your personal Google account to display your Grafana dashboard on your team’s TV. Your personal Google account may have editing or admin permissions.

Create and edit dashboards

Contact Reliability Engineering using the #re-prometheus-support Slack channel to request:

  • admin permissions to create or edit a dashboard in your team folder
  • a new team folder and the admin rights to manage it
  • copies of our example dashboards to customise

Use our example dashboards

You can use the default and template dashboards in the General folder of the Grafana Home Dashboard. For example, the GDS Application Metrics Default Dashboard and the GDS Container Metrics Default Dashboard.

You can use these dashboards to get started customising your own dashboards, or if you want to check your monitoring works as you expect.

GDS Application Metrics Default Dashboard

The GDS Application Metrics Default Dashboard displays application metrics produced by Ruby, Java with Dropwizard and Python clients.

If you’ve configured your application with one of these libraries you can view its metrics. You can do this by selecting your application from the Available Apps dropdown.

If the dropdown does not include your application, check the instructions for setting up metrics and make sure any changes are deployed to GOV.UK Platform as a Service (PaaS).

GDS Container Metrics Default Dashboard

The GDS Container Metrics Default Dashboard displays container metrics produced by the paas-metric-exporter. Select your application from the App dropdown to view metrics for your application when you run the paas-metric-exporter.

If the dropdown down does not include your application, check the instructions for setting up the PaaS metric exporter app with Prometheus and make sure any changes are deployed to GOV.UK Platform as a Service (PaaS).

Official and community built dashboards

You can import official and community built Grafana dashboards. For example, you could display backing service metrics such as ElasticSearch or PostgreSQL.

Learn how to import dashboards.

Create and edit alerts using Prometheus

When deciding what alerts you want to receive, consider:

How to create or update alerting rules

You should first read Prometheus’ alerting rules documentation to understand how alerting works.

You will also need to understand how to write an expression in PromQL for your alerting rules.

Finding your metrics

Prometheus contains metrics related to other teams which may not be relevant to you.

To see your team’s available metrics, run the following queries to return a list of metric names available for your PaaS organisation or metric exporter.

sum by(__name__) ({org="<<ORG_NAME>>"}) for example sum by(__name__) ({org="govuk-notify"})

sum by(__name__) ({job="<<EXPORTER_APP_NAME>>"}) for example sum by(__name__) ({job="notify-paas-postgres-exporter"})

It’s not currently possible to order these results alphabetically.

Writing your alerting rule PromQL expression

Use the Prometheus dashboard to experiment writing your alert as a PromQL expression.

An example PromQL expression is:

rate(requests{org="gds-tech-ops", job="observe-metric-exporter", status_range="5xx"}[5m])

The above query means “amount of requests with status 5xx within the last 5 minutes for org gds-tech-ops and job observe-metrics-exporter.

To make it into an alert (that is, something that triggers if the data values are higher or lower than expected), the expression requires a threshold to be compared against:

rate(requests{org="gds-tech-ops", job="observe-metric-exporter", status_range="5xx"}[5m]) > 1

Your expression should contain an org label, which refers to your PaaS organisation. This ensures you only use the metrics from your team. Although the job label may serve the same purpose, it is not guaranteed to be unique to your team.

You should only include timeseries for the PaaS space you wish to alert on, for example only including production using the space="production" label.

Decide your alerting thresholds

Queries need thresholds added to them to make them into alerts. You can work out an alert’s threshold value from historical data. To do this, use your current monitoring system’s thresholds, averages and spikes for each alert.

For new alerts, experiment with different thresholds until you find one that fits your:

  • chosen type of alert
  • alerting priorities
  • metric’s patterns
  • Service Level Objective

Create the alerting rule

Alerting rules are defined in YAML format in a config file in the prometheus-aws-configuration-beta repository. Each product team should use their own file for their alerting rules.

Alerting rules should be prefixed with your team name, for example registers_RequestsExcess5xx or DGU_HighDiskUsage. This makes your alert easier to identify.

You must add a product label to your alerting rule under labels so if the alert is triggered, Prometheus will alert the correct team.

You may have to iterate your alerting rules to make them more useful for your team. For example, you may get alerts that do not require any action as the threshold is too low (false positives).

For more information about creating alerts, see the prometheus-aws-configuration-beta README for an explanation of each field’s meaning and an example alert you can customise.

Create a PR with your alerting rule

Create a pull request for changes to your alerting rules file. Your commit should explain what the alert is and why you picked the threshold value. This is so future team members have the context they need to confidently change the alerting rule and other teams can learn from your alerting rules.

Share your pull request in the #re-prometheus-support Slack channel channel so we can review it. We will try to merge and deploy your pull request as quickly as possible and will let you know when your alerting rule is live.

How to receive alerts

Once Prometheus triggers an alert, it sends the alert to Alertmanager. Alertmanager is then responsible for forwarding alerts to receivers such as Pagerduty or Zendesk.

Alerts are forwarded to the appropriate team and receiver using the Alertmanager config file which uses the alert labels to direct the alert to the right team and receiver.

If you have not yet set up a receiver or would like to set up additional receivers use the #re-prometheus-support Slack channel.

If you need additional PagerDuty licences contact the Reliability Engineering Autom8 team using the #re-autom8 Slack channel.

Set up custom metrics

Using the Prometheus client libraries, both Java and Ruby vesions, we are allowed to create our own custom metrics to measure things specific to our applications in addtion to the generic metrics offered by the libraries. The libraries include the Prometheus simpleclient offering four metric types.

Counter

A counter is a cumulative metric that represents a single numerical value that only ever goes up. A counter is typically used to count requests served, tasks completed, errors occurred, etc. Counters should not be used to expose current counts of items whose number can also go down, e.g. the number of currently running threads. Use gauges for this use case.

Gauge

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.

Gauges are typically used for measured values like temperatures or current memory usage, but also “counts” that can go up and down, like the number of running threads.

Histogram

A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.

A histogram with a base metric name of <basename> exposes multiple time series during a scrape:

  • cumulative counters for the observation buckets, exposed as <basename>_bucket{le="<upper inclusive bound>"}
  • the total sum of all observed values, exposed as <basename>_sum
  • the count of events that have been observed, exposed as <basename>_count (identical to <basename>_bucket{le="+Inf"} above)

Summary

Similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.

A summary with a base metric name of <basename> exposes multiple time series during a scrape:

  • streaming φ-quantiles (0 ≤ φ ≤ 1) of observed events, exposed as <basename>{quantile="<φ>"}
  • the total sum of all observed values, exposed as <basename>_sum
  • the count of events that have been observed, exposed as <basename>_count

How to add custom metrics

More detailed explanations on the different types of metrics can be found in the Prometheus documentation.

Instructions on how to add new metrics using the Java implementation can be found in the library documentation. Ruby users can find similar documentation in the client_ruby documentation.