Get in Touch

Scroll Down

Postmortems

January 5, 2023Category : SRE

Tags : Incident Management postmortem SRE

Introduction

Postmortems are expected after any significant undesirable event. Writing a postmortem is not punishment – it is a learning opportunity for the entire company.

A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.

The primary goals of writing a postmortem are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.

Definitions

Postmortem: blame-free analysis or factors leading to a service disruption and the actions take to mitigate it. Institutionalizes a culture of continuous improvement for mission-critical systems.

RCA: Root Cause Analysis, process of discovering root causes of a problem to systematically prevent and solve underlying issues.

Action Item: A task, fix or project that is part of the remediation plan of an RCA.

Postmortem Process

The postmortem process can be summarized in five steps as follows:

Manage the Incident
Create the First Draft of the Postmortem
Conduct a Postmortem Review
Post the Documentation
Follow-up on Action Items

1. Manage the Incident

To determine whether a postmortem is required is dependent on the impact of the issue which will typically arrive in the queue as a Jira “Incident”. If a customer or a service was impacted by the incident, then a Jira “Problem” ticket is required to be manually created as well as subsequent full postmortem process and documentation. Any RCA AIs should initiate subsequent COPS change requests or a SAFE ticket for corrective actions.

Once the “Problem” is resolved and all work completed, the engineer is to close the “Problem” ticket.
In some cases, “Problems” may drive a project to be created in SAFE if the level of effort required exceeds 7 days.
SAFE tickets should be tagged within both the postmortem documentation and “Problem” ticket.
The postmortem should be linked to within the “Problem” ticket.

2. Create the First Draft of the Postmortem

The first draft of the postmortem should be written by a small group of engineers that are able to collaborate to answer the following questions. Review criteria might include:

Was key incident data collected for posterity?
Are the impact assessments complete?
Was the root cause sufficiently deep?
Is the action plan appropriate and are resulting bug fixes at appropriate priority?

In practice, teams share the first postmortem draft internally and solicit a group of senior engineers to assess the draft for completeness.

To standardize the CVS and Astra SREs on the documentation and data required for a postmortem at NetApp, please find the template Postmortem Template.

Having trouble filling out the template or want to ensure you are providing complete and accurate data? See the Postmortem Template – How To for additional help.

3. Conduct a Postmortem Review

An unreviewed postmortem might as well never have existed. Once the initial review is complete, the postmortem is shared more broadly with the larger engineering team. The goal is to share postmortems to the widest possible audience that would benefit from the knowledge or lessons imparted. The questions asked are much a repeat of the questions in the “Postmortem – First Draft” step above – but with a larger audience that may offer additional insights, ideas and guidance. In these meetings, it is important to close out any ongoing discussions and comments, to capture ideas, and to finalize the state.

Was key incident data collected for posterity?
Are the impact assessments complete?
Was the root cause sufficiently deep?
Is the action plan appropriate and are resulting bug fixes at appropriate priority?
Did we share the outcome with relevant stakeholders?

4. Post the Documentation

Once those involved are satisfied with the document and its action items, the postmortem is added to a team or organization repository of past incidents.

5. Follow Up on Action Items

The postmortem documentation has been created, reviewed, approved and posted. Are you done? Well, not quite yet … A postmortem with no Action Items (AI) is ineffective and even less effective if these AIs are not followed up on and completed.

Plan the Work and Work the Plan

Guidelines

Blameless Postmortems

Blameless postmortems are a tenet of SRE culture. For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the “wrong” thing prevails, people will not bring issues to light for fear of punishment.

Avoid Blame and Keep It Constructive

Postmortem Triggers

User-visible downtime or degradation beyond a certain threshold
Data loss of any kind
On-call engineer intervention (release rollback, rerouting of traffic, etc.)
A resolution time above some threshold
A monitoring failure (which usually implies manual incident discovery)

The 5 Whys

Five whys (or 5 whys) is an iterative interrogative technique used to explore the cause-and-effect relationships underlying a particular problem. The primary goal of the technique is to determine the root cause of a defect or problem by repeating the question “Why?”. Each answer forms the basis of the next question.

The model follows a very simple process:

1. Assemble a Team

Gather together people who are familiar with the specifics of the problem, and with the process that you’re trying to fix.

2. Define the Problem

If you can, observe the problem in action. Discuss it with your team and write a brief, clear problem statement that you all agree on. For example, “Team A isn’t meeting its response time targets” or “Software release B resulted in too many rollback failures.”

Then, write your statement.

3. Ask the First “Why?”

Ask your team why the problem is occurring. (For example, “Why isn’t Team A meeting its response time targets?”)

Asking “Why?” sounds simple, but answering it requires serious thought. Search for answers that are grounded in fact: they must be accounts of things that have actually happened, not guesses at what might have happened.

Your team members may come up with one obvious reason why, or several plausible ones. Record their answers as succinct phrases, rather than as single words or lengthy statements, and write them below your problem statement. For example, saying “volume of calls is too high” is better than a vague “overloaded.”

4. Ask “Why?” Four More Times

For each of the answers that you generated in Step 3, ask four further “whys” in succession. Each time, frame the question in response to the answer you’ve just recorded.

Try to move quickly from one question to the next so that you have the full picture before you jump to any conclusions.

Postmortem Actions Items

A postmortem with no Action Items (AI) is ineffective. Each AI is required to have an assigned owner which should be well documented within the postmortem. AIs should contain measurable and clear success criteria with a verifiable end state.

AIs are aimed at improving the system, not improving people.
AIs should focus on preventing the possibility of recurrence of the same incident in the future. Where else might this problem manifest?
AIs should contain measurable and clear success criteria with a verifiable end state.
What automation or ML opportunities exist?
What improvements can be made to the detection system?

Include action items such as:

Any fixes required to prevent the contributing factor in the future
Any preparedness tasks that could help mitigate the problem if it came up again
Remaining postmortem steps, such as the internal email, as well as the status-page public post
Any improvements to our incident response process

For more information on Postmortem Action Items best practices, please see the article here: Postmortem Action Items: Plan the Work and Work the Plan

0 Comment