A post-mortem, sometimes called an incident review or retrospective analysis, is a structured meeting held after a significant incident, outage, or failure. You can use Meeting Planner to schedule the session at a time that works for all participants. Its purpose is to understand what happened, why it happened, and what the organisation can do to prevent similar events in the future. The defining principle of an effective post-mortem is that it must be blameless.
Blameless does not mean accountability-free. It means the discussion focuses on systemic factors, process gaps, and environmental conditions rather than individual mistakes. Humans make errors; that is a given. The question is why the system allowed that error to reach production, cascade into an outage, or go undetected for hours. When people feel safe to share what actually happened without fear of punishment, the organisation learns far more than it would from a sanitised version of events.
Post-mortems typically run for 60 to 90 minutes and should be held within one to five working days of the incident, while memories are fresh but emotions have had time to settle. The output is a written document that captures the timeline, root causes, contributing factors, and a prioritised list of action items with clear ownership and deadlines.
Not every minor bug or brief service degradation warrants a formal post-mortem. Reserve this format for incidents that meet one or more of the following criteria:
| Role | Responsibility |
|---|---|
| Facilitator | Guides the discussion, enforces the blameless norm, manages time, and ensures the conversation stays focused on learning rather than blame. |
| Incident Commander | Provides the operational timeline, explains decisions made during the incident, and shares context on coordination challenges. |
| Engineers involved in resolution | Share technical details of what they observed, diagnosed, and fixed. Offer perspective on tooling gaps or monitoring blind spots. |
| On-call responders | Describe how the alert reached them, initial triage steps, and any delays in escalation or handoff. |
| Product / Customer-facing representative | Provides context on customer impact, communication gaps, and downstream effects on other teams. |
| Note-taker | Documents the timeline, root causes, action items, and key discussion points in the post-mortem template. |
Invite everyone who was directly involved in detecting, responding to, or resolving the incident. Exclude managers who were not directly involved unless they can add specific context without shifting the tone towards blame.
| Duration | Activity | Notes |
|---|---|---|
| 5 min | Set the tone | Facilitator restates the blameless norm. Remind the group that the goal is system improvement, not fault-finding. |
| 15 min | Timeline reconstruction | Walk through the incident chronologically. Use logs, alerts, and chat transcripts to build an accurate, shared timeline. |
| 20 min | Root cause analysis | Apply the 5 Whys or a contributing factors tree. Identify technical, process, and organisational factors. For complex issues, consider running a dedicated problem-solving workshop. |
| 10 min | What went well | Acknowledge effective responses, good judgement calls, and processes that worked as intended. This context matters for learning. |
| 10 min | What could be improved | Identify gaps in monitoring, alerting, documentation, communication, and response procedures. |
| 15 min | Action items | Generate specific, measurable actions. Assign a single owner and a deadline to each. Prioritise by impact and effort. |
| 5 min | Wrap-up and distribution | Confirm action item owners. Agree on where the post-mortem document will be published and who will review it. |
An engineering team at an e-commerce platform experienced a 4-hour production outage on a Tuesday afternoon, caused by a database migration that locked a critical table. The incident affected all checkout transactions, resulting in an estimated revenue loss of $180,000 and roughly 2,400 failed customer orders.
Three days after the incident, the team gathers for a 90-minute post-mortem. The facilitator, a senior engineer who was not directly involved, begins by restating the blameless principle and asking the incident commander to walk through the timeline. The group reconstructs events using PagerDuty alerts, Slack messages, and database logs: the migration was deployed at 14:12 via the standard CI/CD pipeline, the first customer error alerts fired at 14:17, the on-call engineer acknowledged at 14:23, and the incident was declared at 14:31 after initial triage ruled out a transient issue.
The root cause analysis reveals a chain of contributing factors. The migration script acquired an exclusive lock on the orders table, which was expected to complete in seconds on the staging environment but took over four hours on the production database due to a significantly larger dataset. The staging environment contained only 50,000 rows compared to production's 12 million. There was no migration review process that accounted for table size, and the CI/CD pipeline had no safeguard to prevent long-running locks during business hours. The team identifies six action items: implement a migration pre-check that compares table sizes between staging and production, add a circuit breaker that cancels migrations exceeding a configurable lock duration, require peer review for any migration touching tables with more than one million rows, update the staging environment to use production-scale data volumes for critical tables, add a runbook for database lock incidents, and schedule migrations outside peak traffic hours by default. Each action is assigned a single owner and given a two-week deadline.