The Art of the Postmortem

lesson
blameless-postmortems
contributing-factors
action-items
incident-culture
examples
l1 ---# The Art of the Postmortem

Topics: blameless postmortems, contributing factors, action items, incident culture, examples Level: L1 (Foundations — everyone should learn this) Time: 45–60 minutes Prerequisites: None

The Mission¶

Your team just had an outage. The VP asks for a postmortem. You write one. It says "root cause: human error" and lists action items like "be more careful" and "improve processes." Six weeks later, the same incident happens again. The postmortem failed — not because nobody wrote it, but because it blamed a person instead of fixing a system.

This lesson teaches how to write postmortems that actually prevent repeat incidents.

The Blameless Foundation¶

Engineers don't make mistakes because they're careless. They make mistakes because the systems they work in create conditions where that mistake was the most likely outcome given the available information.

Fix the engineer → one less engineer. Fix the system → prevent the entire class of future incidents.

The local rationality test¶

For every action in the timeline, ask: "Was this decision locally rational given the information this person actually had?" Not what you know now, after the incident. What they knew then, in the moment, with the tools and information available to them.

In almost all serious incidents, the answer is yes. The operator made the best decision they could with what they had. The failure was in what they didn't have — missing monitoring, misleading alerts, outdated runbooks, absent guardrails.

Name Origin: The blameless postmortem was popularized by John Allspaw (Etsy CTO) and Sidney Dekker (human factors researcher). Allspaw's 2012 blog post "Blameless PostMortems and a Just Culture" became the template for the industry. Dekker's "Field Guide to Understanding Human Error" provides the academic foundation.

Postmortem Structure That Works¶

1. Timeline (the spine)¶

Minute-by-minute, factual, no judgment. This is the most important section because it forces precision and prevents revisionist history.

14:00 — Deploy v2.3.1 pushed to production (contained DB migration)
14:00 — Migration starts: CREATE INDEX on users table (50M rows)
14:00 — Table acquires ACCESS EXCLUSIVE lock
14:02 — First query timeouts (queries blocked by lock)
14:05 — Error rate crosses 5%
14:12 — PagerDuty alert fires (for: 10m + scrape delay)
14:12 — On-call acknowledges. Opens Grafana.
14:15 — On-call suspects deploy. Checks deploy log.
14:20 — On-call runs: SELECT * FROM pg_stat_activity WHERE state = 'active'
14:22 — Identifies CREATE INDEX holding lock for 22 minutes
14:23 — Runs: SELECT pg_terminate_backend(pid) — kills migration
14:25 — Queries resume. Error rate normalizes.
14:30 — All clear. Incident channel closed.

Good timeline language: "At 14:12, the on-call concluded from available metrics that..." Bad timeline language: "The engineer failed to check the migration before deploying."

2. Contributing factors (not "root cause")¶

Multiple systemic factors, not one human error:

1. Migration not tested against production-sized data
   - Staging has 10K rows; production has 50M
   - Migration took 3 seconds in staging, 25+ minutes in production

2. No timeout protection on migration jobs
   - Migration ran indefinitely
   - No CI check for expected migration duration

3. CREATE INDEX without CONCURRENTLY
   - Default CREATE INDEX acquires ACCESS EXCLUSIVE lock
   - CREATE INDEX CONCURRENTLY avoids this but wasn't used

4. Alert took 12 minutes to fire
   - for: 10m required 10 continuous minutes above threshold
   - Plus scrape interval delay
   - Total detection: 12 minutes

5. No pre-deploy migration safety check
   - No automated check for table size before running migrations
   - No runbook entry for "large table migration"

3. What went well¶

This section is critical for morale and for reinforcing good behavior:

- On-call acknowledged within 30 seconds
- pg_stat_activity correctly identified the blocking query
- pg_terminate_backend safely stopped the migration
- Incident channel opened immediately with clear communication
- Rollback to v2.3.0 was available as backup option

4. What went poorly¶

- Migration not tested against production data volume
- 12-minute detection delay
- No automated guardrail for table size × migration type
- Runbook had no entry for "migration locks table"

5. Action items (specific, assigned, time-bounded)¶

✗ BAD: "Improve migration testing"
✓ GOOD: "Add CI check that runs migrations against production-sized
         synthetic data (50M rows). Assigned: @alice. Due: April 5."

✗ BAD: "Better monitoring"
✓ GOOD: "Reduce error rate alert for: from 10m to 3m. Assigned: @bob.
         Due: March 28."

✗ BAD: "Be more careful with migrations"
✓ GOOD: "Add pre-migration hook that warns if target table > 1M rows
         and migration contains non-concurrent CREATE INDEX.
         Assigned: @carol. Due: April 10."

The Action Item Problem¶

War Story: A team wrote 73 postmortems over 2 years with 219 action items. Completion rate: 15.5%. Eleven incidents repeated. Three repeated three times.

The action items were in Confluence. Nobody checked Confluence. Nobody tracked completion. Nobody was accountable.

After a Redis split-brain incident repeated (same root cause, same fix documented 3 months prior), the team moved all action items to Jira with owners, due dates, and sprint assignments. Completion rate went from 15.5% to 89% in one quarter.

Making action items stick¶

Move to your issue tracker (Jira, Linear, GitHub Issues) — not a doc
Assign an owner (not a team — one person)
Set a due date (not "soon" — a specific date)
Track in sprint planning (treat like customer-facing bugs)
Review in retrospectives ("did we close last month's postmortem actions?")

If it's not in the tracker with a deadline, it won't happen.

Common Postmortem Failures¶

"Root cause: human error"¶

This is never the root cause. The human acted rationally given their information. The root causes are the systemic gaps that put the human in a position to make that mistake: - Why was the dangerous action possible without a guardrail? - Why didn't monitoring catch it earlier? - Why was the runbook missing this scenario?

The hindsight trap¶

Knowing the outcome makes the path to failure look obvious. But at decision time, the operator had 25% signal and 75% noise. The postmortem reader has 87% "obvious" and 13% "how did they miss it?" because they know the answer.

Counter-measure: For each decision point in the timeline, list what the operator actually had available: which metrics, which logs, which alerts, which runbooks. Then ask: "Given only this information, would I have done differently?"

The blame spiral¶

Even "blameless" postmortems can feel blameful if the language is wrong:

✗ "The engineer incorrectly assumed the database was healthy."
✓ "At 14:15, available metrics (CPU 12%, connections 45/100, no error
   alerts) indicated the database was healthy. The table lock was not
   visible in the standard monitoring dashboard."

Flashcard Check¶

Q1: "Root cause: human error" — why is this wrong?

It blames a person instead of fixing the system. The human acted rationally given available information. The real causes are missing guardrails, monitoring gaps, and runbook deficiencies.

Q2: Action item: "improve monitoring." Is this good enough?

No. Not specific, not assigned, not time-bounded. "Add alert for table lock duration

5 minutes, assigned to @bob, due April 1" is actionable.

Q3: Why track action items in Jira instead of a postmortem doc?

Docs aren't tracked. Nobody reviews them in sprint planning. Completion rate for doc-based action items is typically <20%. Issue tracker with owners and deadlines: >80%.

Q4: What is local rationality?

The test: was the operator's decision rational given the information they actually had at decision time? Not what you know now with hindsight. Almost always yes.

Cheat Sheet¶

Postmortem Template¶

# Incident: [Title]
**Date:** YYYY-MM-DD  **Duration:** Xm  **Severity:** SEV-N

## Timeline (UTC)
- HH:MM — Event
- HH:MM — Event

## Contributing Factors
1. Factor (systemic, not personal)
2. Factor
3. Factor

## What Went Well
-

## What Went Poorly
-

## Action Items
| # | Action | Owner | Due | Status |
|---|--------|-------|-----|--------|
| 1 | Specific action | @name | Date | Open |

Language Guide¶

Instead of	Write
"Engineer failed to..."	"At HH:MM, available information indicated..."
"Should have known..."	"The monitoring did not surface..."
"Human error"	"Contributing factors: [systemic gaps]"
"Be more careful"	"Add guardrail: [specific check]"

Takeaways¶

Blameless means systemic. Fix the system, not the person. The operator acted rationally. The system failed to support them.
The timeline is the most important section. It forces precision, prevents revisionism, and shows the information landscape at each decision point.
Action items must be tracked like bugs. Issue tracker, owner, due date, sprint assignment. If it's in a doc, it's dead.
"Root cause: human error" is always wrong. The root causes are the missing guardrails, monitoring gaps, and runbook deficiencies that enabled the error.
Hindsight bias makes everything look obvious. At decision time, the operator had 25% signal. The postmortem reader has 87% "obvious." Account for this when writing.

How Incident Response Actually Works — the incident that leads to the postmortem
Prometheus and the Art of Not Alerting — monitoring gaps that contribute to incidents
The Rollback That Wasn't — when the fix itself becomes an incident