How and Why to Conduct Incident Postmortems

darius_koohmare · ‎10-21-2022

When major incidents and service degradations occur, your first goal is to restore the levels to an acceptable norm. Modern organizations respond quickly using Site Reliability Engineering principles, and modules like Site Reliability Operations on ServiceNow. Once the fire has been put out, the focus and conversations should shift to preventing the issue from happening again. The post mortem or post incident review seeks to help identify this root cause and ask ‘why’ the hole in the hull appeared in the first place, and to do so in a blameless fashion. An easy way to take a blameless perspective is never to find fault or to punish a human for their error, but rather to look for the underlying process or technology that needs changing.

As Norm Kerth states in Project Retrospectives, “Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what was known at the time, their skills and abilities, the resources available, and the situation at hand.”

The post mortem process after incidents are meant to help you identify what happened, why it happened, and how it will/can be mitigated or prevented. The contents of the post mortems are commonly desired by your stakeholders, customers, and leadership. Post mortems will help you make the system more resilient by minimizing or preventing the issue from happening again, whether that means adding in new processes or changing technology.

Writing Post Mortems

A good question to begin with when implementing a post mortem process is identifying what information your organization wants to capture, and who the audience is (internal vs external). While the recommended sections would be the same regardless of audience, you may be more informal and more explicit in internal architecture or data references within the sections for internal only audiences. After evaluating numerous post mortems from major tech companies, we’ve come to the following recommendations for content:

Executive summary of what happened and for how long: Begin with a short summary of what went wrong, when the issue was identified, and when it was remediated. This doesn’t need to be overly technical.

The customer / service impact: Clearly state what the impact was, in context to geographies or services. It is common to list a count of customers, products, or environments affected.
The timeline from detection to remediation: Adding key events like mean time to detect, mean time to resolve, and any incremental change in impact or partial service restoration. Presenting this as a list of timestamps from oldest to newest is common.
Detailed summary of what went wrong, in the context of the company's environment: This is your opportunity to get technical around what went wrong, and why it went wrong. For example, if you ran into a networking backbone issue from a datacenter, it could be beneficial to give a high level architectural overview of your data center architecture and why the backbone was in place, and how it failed. You can also provide greater detail around the steps involved in resolution. While common for internal documents, this may be optional to send externally due to proprietary information.
Any completed or outstanding action items to prevent it from happening again: With a clear description of what went wrong, it’s good to end with a description of the one or more action items that were completed, or are planned to be completed, that will prevent the issue from reoccurring.

If you want a template sample of similar type of content, Google offers a public post mortem template available here.

With an idea of the end goal of what your post mortem should contain, you’ll find that it’s useful to clearly add updates and document key events within your incidents as you solve them. Later, you'll capture identified action items in your development tool as a formal action item that requires follow up. Identifying the post mortem owner or scribe during the process can help ensuring this work gets done. Even as the main goal is to restore the service during the incident, you can still incrementally be adding the updates that will build into the post mortem later. Only after the incident is resolved can the full effort be focused on the post mortem completion.

"The incident cannot be closed without a post mortem taking place" - Jon Noble, SRE, Sage Group.

Once the post mortem is completed, the incident can be treated as closed.

How and Why to Conduct Incident Postmortems

The Case Against Catalog Consolidation

Out-of-the-Box Recommended Actions in ServiceNow ITSM

Incident Tasks vs. Parent/Child Incidents: Knowing When to Use Each