Skip to content

Latest commit

 

History

History
90 lines (53 loc) · 2.25 KB

incident-management-procedure.md

File metadata and controls

90 lines (53 loc) · 2.25 KB

< devops-project-template

Incident management procedure

An incident is a single unplanned event that causes a service disruption.

Clone this repo and document your incident management procedure here:



Content

Incident management is the process of resolving an incident. An incident is resolved when the affected service resumes functioning in its intended state. This includes only those tasks required to mitigate impact and restore functionality.

Incident Management needs a monitoring and observability strategy.

See Atlassian for a practical incident management guide.

Tips and hints

  • Allow for autonomous decision-making by people and teams involved

  • Utilize on-call scheduling to position developers and sysadmins as SREs

  • Use an easy-to-remember URL that redirects to the internal Service Desk portal

  • Ensure members of your teams can communicate across the organization with chat tools (see chatops)

  • Formalize your optimization strategy

Stages

The IM process has 5 stages :

  1. Detect
  2. Respond
  3. Recover & clean-up
  4. Learn & postmortem
  5. Improve

Roles

Typical roles are:

  • Incident manager
  • Tech lead
  • Communications manager

Proximate and root causes

To prevent repetition of the incident, the root cause has to be found. It’s necessary to distinguish between the proximate and root causes.

  • Proximate causes are reasons that directly led to this incident.
  • Root causes are reasons at the optimal place in the chain of events where making a change will prevent this entire class of incident.

After the incident is resolved (recover), a postmortem seeks to discover root causes and decide how to best mitigate them. This often results in stories on the backlog (improve).

Root causes can fall into different categories:

  • Bug
  • Change
  • Scale
  • Architecture
  • Dependency
  • Unknown

SRE

ITIL