An incident is a single unplanned event that causes a service disruption.
Clone this repo and document your incident management procedure here:
Content
Incident management is the process of resolving an incident. An incident is resolved when the affected service resumes functioning in its intended state. This includes only those tasks required to mitigate impact and restore functionality.
Incident Management needs a monitoring and observability strategy.
See Atlassian for a practical incident management guide.
-
Allow for autonomous decision-making by people and teams involved
-
Utilize on-call scheduling to position developers and sysadmins as SREs
-
Use an easy-to-remember URL that redirects to the internal Service Desk portal
-
Ensure members of your teams can communicate across the organization with chat tools (see chatops)
-
Formalize your optimization strategy
The IM process has 5 stages :
- Detect
- Respond
- Recover & clean-up
- Learn & postmortem
- Improve
Typical roles are:
- Incident manager
- Tech lead
- Communications manager
To prevent repetition of the incident, the root cause has to be found. It’s necessary to distinguish between the proximate and root causes.
- Proximate causes are reasons that directly led to this incident.
- Root causes are reasons at the optimal place in the chain of events where making a change will prevent this entire class of incident.
After the incident is resolved (recover), a postmortem seeks to discover root causes and decide how to best mitigate them. This often results in stories on the backlog (improve).
Root causes can fall into different categories:
- Bug
- Change
- Scale
- Architecture
- Dependency
- Unknown