The Postmortem Nobody Read¶

Category: The Hard Lesson Domains: incident-command, sre-practices Read time: ~5 min

Setting the Scene¶

We wrote excellent postmortems. Blameless, thorough, with root cause analysis and action items. They lived in a Confluence space called "Incident Reviews" with 73 documents spanning two years. Each one ended with a tidy table of action items, owners, and due dates. They were, by any measure, a model of engineering maturity.

There was just one problem: nobody ever did the action items.

What Happened¶

In January, our Redis cluster experienced a split-brain during a network partition. The Sentinel configuration had quorum set to 1 instead of 2, meaning a single Sentinel could trigger a failover even during a partition. Both sides of the partition promoted a new primary. Writes went to both. When the partition healed, we had divergent datasets and lost about 40 minutes of session data. 15,000 users were logged out simultaneously.

The postmortem was textbook. We identified the root cause, documented the timeline, and created three action items: (1) fix the Sentinel quorum to 2, (2) add a monitoring check for split-brain conditions, (3) run a chaos engineering exercise to validate the fix. Owner: platform team. Due date: February 15.

February 15 came and went. The action items sat in Confluence. The platform team was heads-down on a product launch. Nobody followed up.

In April, during a datacenter network maintenance window, the exact same network partition occurred. The exact same split-brain happened. The exact same data divergence. This time it was worse — 90 minutes of data loss, 42,000 users logged out, and a payments reconciliation that took 5 days.

I pulled up the January postmortem in the April incident bridge call. Read the action items out loud. The silence on the Zoom was deafening. Somebody unmuted and said, "I thought we fixed that."

The Moment of Truth¶

After the second incident, I audited every postmortem from the past two years. Of 73 postmortems containing 219 action items, 34 had been completed. That's a 15.5% completion rate. Eleven incidents had repeated at least once. Three had repeated three times. We weren't learning from our failures — we were documenting them for posterity and moving on.

The Aftermath¶

We moved action items out of Confluence and into Jira with a dedicated "Postmortem Action" issue type. Each action got an owner, a due date, and a weekly automated reminder. Incomplete actions appeared in the engineering leadership weekly review. We also started "postmortem action review" as a standing agenda item in team retros.

The completion rate went from 15% to 89% in three months. The repeated incidents stopped. It wasn't magic — it was just treating postmortem actions with the same seriousness as customer-facing bugs.

The Lessons¶

Postmortem actions need owners and deadlines: An action item without an owner is a wish. An action item without a deadline is a suggestion. Neither prevents the next incident.
Track postmortem actions like bugs: If it's not in your issue tracker with a priority and an SLA, it's not going to happen. Confluence documents don't send reminders.
If you don't follow through, you'll repeat the incident: A postmortem without completed actions is just a well-formatted prediction of your next outage.

What I'd Do Differently¶

I'd automate the pipeline from postmortem to ticket from day one. The postmortem template would include a section that auto-creates Jira tickets when the document is published. I'd also institute a "postmortem debt" metric — the count of open postmortem actions older than their due date — and display it on the team's dashboard right next to uptime and error rate.

The Quote¶

"We didn't have 73 postmortems. We had 73 predictions of future incidents, and we ignored 85% of them."

Cross-References¶

Topic Packs: Incident Command, SRE Practices
Case Studies: Alert Storm Flapping Healthchecks