Table of contents

This documentation is intended for internal use by the GDS community.

Logit incident management

To find out about technical incidents affecting Logit you must:

Reliability Engineering uses these channels to tell GDS Logit users:

  • about incidents such as Kibana or ElasticSearch being unavailable
  • when an incident is resolved and normal service is resumed

Logit will keep in contact with Reliability Engineering during an incident, and ensure their Logit status page is up to date.

How you should report an incident

If you discover Logit or Kibana have failed, the process you must follow will differ depending on whether you are working during office hours or out-of-hours.

During office hours (9.30am to 5.30pm)

  1. Check the Logit status page to see if Logit know about the incident.
  2. Join the #reliability-eng Slack channel and describe the problem.
  3. Get confirmation that a Reliability Engineering team member has taken over the incident.
  4. Wait for updates from Reliability Engineering announcements or the #reliability-eng Slack channel.

The Reliability Engineering Tools team will work directly with Logit coordinating communication on behalf of GDS.

Outside office hours (from 5.30pm until 9.30am)

Logit provide an automated support process for out of hours incidents, for example in the event of a platform-wide outage an on-call Logit engineer is automaticlly paged. When you contact Logit outside office hours you need the Logit support number and PIN.

How to report a Logit incident out-of-hours:

  1. Use the Logit status page to check if Logit know about the incident. If Logit have updated their status page, check this for updates on a half-hourly basis until the incident is resolved.
  2. If Logit are unaware of the issue, and it’s urgent, you can telephone Logit to wake an on-call engineer.
  3. Call the Logit support number and enter the pin to leave your message.
  4. Describe the incident leaving your name, and contact number, Logit will call you back within 30 minutes to acknowledge the incident.

Logit provides updates:

  • directly to you if the incident is stack related
  • on the Logit status page if platform related
  1. Join the #reliability-eng Slack channel and describe the issue so Reliability Engineering can contact you when office hours resume.
  2. When office hours resume, a Reliability Engineering team member will confirm they have taken over the incident.
  3. Wait for updates from Reliability Engineering announcements or the #reliability-eng Slack channel.

The Reliability Engineering Tools team will work directly with Logit coordinating communication on behalf of GDS.

How to find out about incident reviews

If Logit have a major incident, for example a full Logit outage, Reliability Engineering will announce an internal incident review using the Reliability Engineering announcements Google Group explaining how to register your interest in attending.

After the incident review meeting, Reliability Engineering will:

  1. Email the incident report including a confirmed set of recommendations to Reliability Engineering announcements.
  2. Follow up on actions arising from the incident review.
  3. Follow up with Logit on any actions on their side, including claiming refunds for the month in question according to our Enterprise SLA which is 99.9% uptime.
  4. Email Logit’s own incident report to Reliability Engineering announcements, Logit’s incident report will be sent out within 14 days of the incident.