Post-Mortem / Incident Review Template

Overview

A post-mortem, sometimes called an incident review or retrospective analysis, is a structured meeting held after a significant incident, outage, or failure. You can use Meeting Planner to schedule the session at a time that works for all participants. Its purpose is to understand what happened, why it happened, and what the organisation can do to prevent similar events in the future. The defining principle of an effective post-mortem is that it must be blameless.

Blameless does not mean accountability-free. It means the discussion focuses on systemic factors, process gaps, and environmental conditions rather than individual mistakes. Humans make errors; that is a given. The question is why the system allowed that error to reach production, cascade into an outage, or go undetected for hours. When people feel safe to share what actually happened without fear of punishment, the organisation learns far more than it would from a sanitised version of events.

Post-mortems typically run for 60 to 90 minutes and should be held within one to five working days of the incident, while memories are fresh but emotions have had time to settle. The output is a written document that captures the timeline, root causes, contributing factors, and a prioritised list of action items with clear ownership and deadlines.

When to Use This Framework

Not every minor bug or brief service degradation warrants a formal post-mortem. Reserve this format for incidents that meet one or more of the following criteria:

Customer-facing downtime exceeded a defined threshold (e.g. more than 15 minutes)
Data loss or data integrity issues occurred, regardless of duration
A security breach or vulnerability was exploited
Revenue impact exceeded a material threshold for your organisation
The incident required activation of an on-call escalation or an incident commander
A near-miss occurred that could have resulted in a severe outage under slightly different conditions
The same category of incident has recurred, suggesting underlying systemic issues that may also warrant a risk assessment workshop

Who Should Attend

Role	Responsibility
Facilitator	Guides the discussion, enforces the blameless norm, manages time, and ensures the conversation stays focused on learning rather than blame.
Incident Commander	Provides the operational timeline, explains decisions made during the incident, and shares context on coordination challenges.
Engineers involved in resolution	Share technical details of what they observed, diagnosed, and fixed. Offer perspective on tooling gaps or monitoring blind spots.
On-call responders	Describe how the alert reached them, initial triage steps, and any delays in escalation or handoff.
Product / Customer-facing representative	Provides context on customer impact, communication gaps, and downstream effects on other teams.
Note-taker	Documents the timeline, root causes, action items, and key discussion points in the post-mortem template.

Invite everyone who was directly involved in detecting, responding to, or resolving the incident. Exclude managers who were not directly involved unless they can add specific context without shifting the tone towards blame.

Sample Agenda

Duration	Activity	Notes
5 min	Set the tone	Facilitator restates the blameless norm. Remind the group that the goal is system improvement, not fault-finding.
15 min	Timeline reconstruction	Walk through the incident chronologically. Use logs, alerts, and chat transcripts to build an accurate, shared timeline.
20 min	Root cause analysis	Apply the 5 Whys or a contributing factors tree. Identify technical, process, and organisational factors. For complex issues, consider running a dedicated problem-solving workshop.
10 min	What went well	Acknowledge effective responses, good judgement calls, and processes that worked as intended. This context matters for learning.
10 min	What could be improved	Identify gaps in monitoring, alerting, documentation, communication, and response procedures.
15 min	Action items	Generate specific, measurable actions. Assign a single owner and a deadline to each. Prioritise by impact and effort.
5 min	Wrap-up and distribution	Confirm action item owners. Agree on where the post-mortem document will be published and who will review it.

Example Use Case

An engineering team at an e-commerce platform experienced a 4-hour production outage on a Tuesday afternoon, caused by a database migration that locked a critical table. The incident affected all checkout transactions, resulting in an estimated revenue loss of $180,000 and roughly 2,400 failed customer orders.

Three days after the incident, the team gathers for a 90-minute post-mortem. The facilitator, a senior engineer who was not directly involved, begins by restating the blameless principle and asking the incident commander to walk through the timeline. The group reconstructs events using PagerDuty alerts, Slack messages, and database logs: the migration was deployed at 14:12 via the standard CI/CD pipeline, the first customer error alerts fired at 14:17, the on-call engineer acknowledged at 14:23, and the incident was declared at 14:31 after initial triage ruled out a transient issue.

The root cause analysis reveals a chain of contributing factors. The migration script acquired an exclusive lock on the orders table, which was expected to complete in seconds on the staging environment but took over four hours on the production database due to a significantly larger dataset. The staging environment contained only 50,000 rows compared to production's 12 million. There was no migration review process that accounted for table size, and the CI/CD pipeline had no safeguard to prevent long-running locks during business hours. The team identifies six action items: implement a migration pre-check that compares table sizes between staging and production, add a circuit breaker that cancels migrations exceeding a configurable lock duration, require peer review for any migration touching tables with more than one million rows, update the staging environment to use production-scale data volumes for critical tables, add a runbook for database lock incidents, and schedule migrations outside peak traffic hours by default. Each action is assigned a single owner and given a two-week deadline.

Best Practices

Hold the meeting within five working days. Memories fade quickly. Waiting too long means you lose critical detail about what people were thinking and seeing during the incident.
Start with the timeline, not opinions. Establishing a shared, factual account of events prevents disagreements about what actually happened and grounds the rest of the discussion in evidence.
Use data wherever possible. Logs, metrics, alert timestamps, and chat transcripts are more reliable than memory. Pull this data before the meeting and present it visually if practical.
Acknowledge what went well. Effective incident response deserves recognition. Celebrating good judgement under pressure reinforces the behaviours you want repeated.
Focus on systemic fixes, not individual training. If the conclusion is "person X needs to be more careful", you have not dug deep enough. Ask why the system allowed the error to have that impact.
Publish the post-mortem document widely. Transparency builds trust and helps other teams learn from the incident. Share key findings through a stakeholder update meeting when the impact is significant. Restrict distribution only when genuine security or legal concerns exist.
Track action items to completion. The post-mortem is only valuable if the actions are implemented. Review open action items in your daily stand-up or team's regular meetings until they are closed.
Rotate the facilitator role. Having different people facilitate builds facilitation skills across the team and prevents the post-mortem from becoming a single person's responsibility.

Common Mistakes

Blaming individuals. The moment someone says "this happened because person X made a mistake", the entire room shuts down. People will stop sharing honest accounts if they fear consequences. Redirect immediately to systemic factors.
Holding the meeting too late. A post-mortem three weeks after the incident produces vague recollections and disconnected action items. Schedule it promptly while details are still sharp.
Generating too many action items. A list of 20 actions means nothing will get done. Prioritise ruthlessly. Five well-chosen, high-impact actions with clear owners are worth more than a comprehensive wish list.
Skipping near-misses. If a potential outage was caught by luck or by a single alert that almost was not configured, it deserves a post-mortem. Near-misses are opportunities to fix problems before they cause real damage.
Treating post-mortems as punishment. If teams only hold post-mortems when things go badly wrong, the format becomes associated with failure. Normalise the practice by running lightweight reviews even for minor incidents. Browse our operations templates for related frameworks that support continuous improvement.
Not following up on action items. The most demoralising outcome is when the team invests time in a thorough post-mortem, only to discover months later that none of the actions were completed. Build follow-up into your regular processes.

Overview

When to Use This Framework

Who Should Attend

Sample Agenda

Example Use Case

Best Practices

Common Mistakes

Related Templates

Problem-Solving Workshop

Sprint Retrospective

Decision Review

Stakeholder Update

Related Tools

Action Item Tracker Generator

Meeting Minutes Generator

Decision Log Builder