postmortem
Postmortem Anthology
Realistic postmortem documents as they would appear in a company's
internal wiki. Use these to learn postmortem writing, practice incident
analysis, and understand how different types of failures are documented.
ID
Title
Severity
Duration
Root Cause Category
Domains
PM-001
Production Database Deleted by Terraform Apply on Wrong Workspace
SEV-1
4h 12m
Human Error
k8s, terraform, cloud
PM-002
Expired Wildcard TLS Certificate Causes Full API Gateway Outage
SEV-1
2h 37m
Configuration
k8s, security, networking
PM-003
Race Condition in Distributed Lock Manager Corrupts Shared State
SEV-1
6h 15m
Software Bug
k8s, storage
PM-004
Core Switch Firmware Bug Causes Cascading Network Partition
SEV-1
3h 42m
Infrastructure
datacenter, networking
PM-005
Unbounded Retry Storm Takes Down Payment Processing
SEV-1
1h 54m
Design Flaw
k8s, networking
PM-006
Debug Build Deployed to Production via Copy-Paste Error
SEV-2
47m
Human Error
ci-cd
PM-007
Helm Values Mismatch Routes Staging Traffic to Production DB
SEV-2
38m
Configuration
k8s, networking
PM-008
Memory Leak in Log Shipping Agent Causes Fleet-Wide OOM Kills
SEV-2
52m
Software Bug
linux, k8s
PM-009
AWS AZ Network Degradation Triggers Cascading Health Check Failures
SEV-2
1h 15m
Infrastructure
cloud, networking
PM-010
Missing Runbook Extends CrashLoopBackOff Recovery by 45 Minutes
SEV-2
58m
Process Gap
k8s
PM-011
Missing Circuit Breaker Lets Redis Failure Cascade to All Services
SEV-2
33m
Design Flaw
k8s, observability
PM-012
No Review Gate on Terraform Destroy Leads to Wrong Account Teardown
SEV-2
2h 8m
Process Gap
terraform, cloud
PM-013
BGP Route Leak Sends Customer Traffic Through Monitoring VLAN
SEV-2
28m
Configuration
networking
PM-014
Unbounded Kafka Topic Exhausts Broker Disk
SEV-2
41m
Design Flaw
messaging, storage
PM-015
Custom Controller Missing Backoff Overwhelms API Server
SEV-2
22m
Design Flaw
k8s
PM-016
Ansible Playbook Targets Production Instead of Staging
SEV-3
15m
Human Error
ansible
PM-017
Resource Quota Misconfiguration Blocks All Deployments
SEV-3
2h 10m
Configuration
k8s
PM-018
Prometheus Cardinality Explosion from Debug Labels
SEV-3
1h 30m
Software Bug
k8s, observability
PM-019
SSD Firmware Bug Causes Silent Bit Corruption
SEV-3
3h 0m
Infrastructure
storage, datacenter
PM-020
Stale Docker Base Image Ships Known CVE to Production
SEV-3
14d exposure
Process Gap
ci-cd, security
PM-021
Single etcd Member Disk Full Degrades Control Plane
SEV-3
45m
Design Flaw
k8s
PM-022
DNS CNAME Chain Breaks After Load Balancer Rename
SEV-3
25m
Human Error
dns, k8s
PM-023
On-Call Handoff Gap Leaves Alerts Unacknowledged for 3 Hours
SEV-3
3h gap
Process Gap
ops
PM-024
Kernel TCP Regression After Security Patch
SEV-3
1h 20m
Software Bug
linux
PM-025
UPS Battery Degradation Causes Rack Power Loss During Utility Blip
SEV-3
35m
Infrastructure
datacenter
PM-026
AWS Credentials Committed to Public Repo — Caught by Pre-Commit Hook
Near-Miss
0m
Human Error
ci-cd, security
PM-027
Wildcard Ingress Rule Nearly Exposes Internal Admin Panel
Near-Miss
0m
Configuration
k8s, security
PM-028
Go Dependency Update Silently Changes Default Timeout — Caught in Canary
Near-Miss
0m
Software Bug
k8s
PM-029
S3 Bucket Policy Change Nearly Deletes All Backup Archives
Near-Miss
0m
Infrastructure
cloud, storage
PM-030
Alert Routing Sends All Pages to Decommissioned Channel
Near-Miss
0m
Process Gap
observability, ops
Severity Distribution
Severity
Count
Description
SEV-1
5
Customer-facing outage, revenue impact, >1 hour
SEV-2
10
Partial degradation, customer notices, <1 hour
SEV-3
10
Internal impact, caught before customers noticed
Near-Miss
5
No customer impact, but would have been severe
Root Cause Distribution
Category
Count
IDs
Human Error
5
PM-001, PM-006, PM-016, PM-022, PM-026
Configuration
5
PM-002, PM-007, PM-013, PM-017, PM-027
Software Bug
5
PM-003, PM-008, PM-018, PM-024, PM-028
Infrastructure
5
PM-004, PM-009, PM-019, PM-025, PM-029
Process Gap
5
PM-010, PM-012, PM-020, PM-023, PM-030
Design Flaw
5
PM-005, PM-011, PM-014, PM-015, PM-021
How to Use
Learning postmortem writing: Read 5 postmortems across different severity levels, then write one for a case study from training/library/case-studies/
Incident analysis practice: Read only the Timeline section, try to identify the root cause before reading the Root Cause section
Action item quality: Compare action items across postmortems — which are specific and measurable vs. vague and aspirational?
Detection gap analysis: For each postmortem, ask: "What monitoring would have caught this sooner?"
Contributing factors exercise: Cover the "Contributing Factors" section, read the timeline, and list what you think made the incident worse
Lucky breaks awareness: Read the "What We Got Lucky About" sections — these reveal latent risks still present in most systems
March 27, 2026 01:19:38
March 19, 2026 23:53:02