Skip to content

Postmortem Anthology

Realistic postmortem documents as they would appear in a company's internal wiki. Use these to learn postmortem writing, practice incident analysis, and understand how different types of failures are documented.

ID Title Severity Duration Root Cause Category Domains
PM-001 Production Database Deleted by Terraform Apply on Wrong Workspace SEV-1 4h 12m Human Error k8s, terraform, cloud
PM-002 Expired Wildcard TLS Certificate Causes Full API Gateway Outage SEV-1 2h 37m Configuration k8s, security, networking
PM-003 Race Condition in Distributed Lock Manager Corrupts Shared State SEV-1 6h 15m Software Bug k8s, storage
PM-004 Core Switch Firmware Bug Causes Cascading Network Partition SEV-1 3h 42m Infrastructure datacenter, networking
PM-005 Unbounded Retry Storm Takes Down Payment Processing SEV-1 1h 54m Design Flaw k8s, networking
PM-006 Debug Build Deployed to Production via Copy-Paste Error SEV-2 47m Human Error ci-cd
PM-007 Helm Values Mismatch Routes Staging Traffic to Production DB SEV-2 38m Configuration k8s, networking
PM-008 Memory Leak in Log Shipping Agent Causes Fleet-Wide OOM Kills SEV-2 52m Software Bug linux, k8s
PM-009 AWS AZ Network Degradation Triggers Cascading Health Check Failures SEV-2 1h 15m Infrastructure cloud, networking
PM-010 Missing Runbook Extends CrashLoopBackOff Recovery by 45 Minutes SEV-2 58m Process Gap k8s
PM-011 Missing Circuit Breaker Lets Redis Failure Cascade to All Services SEV-2 33m Design Flaw k8s, observability
PM-012 No Review Gate on Terraform Destroy Leads to Wrong Account Teardown SEV-2 2h 8m Process Gap terraform, cloud
PM-013 BGP Route Leak Sends Customer Traffic Through Monitoring VLAN SEV-2 28m Configuration networking
PM-014 Unbounded Kafka Topic Exhausts Broker Disk SEV-2 41m Design Flaw messaging, storage
PM-015 Custom Controller Missing Backoff Overwhelms API Server SEV-2 22m Design Flaw k8s
PM-016 Ansible Playbook Targets Production Instead of Staging SEV-3 15m Human Error ansible
PM-017 Resource Quota Misconfiguration Blocks All Deployments SEV-3 2h 10m Configuration k8s
PM-018 Prometheus Cardinality Explosion from Debug Labels SEV-3 1h 30m Software Bug k8s, observability
PM-019 SSD Firmware Bug Causes Silent Bit Corruption SEV-3 3h 0m Infrastructure storage, datacenter
PM-020 Stale Docker Base Image Ships Known CVE to Production SEV-3 14d exposure Process Gap ci-cd, security
PM-021 Single etcd Member Disk Full Degrades Control Plane SEV-3 45m Design Flaw k8s
PM-022 DNS CNAME Chain Breaks After Load Balancer Rename SEV-3 25m Human Error dns, k8s
PM-023 On-Call Handoff Gap Leaves Alerts Unacknowledged for 3 Hours SEV-3 3h gap Process Gap ops
PM-024 Kernel TCP Regression After Security Patch SEV-3 1h 20m Software Bug linux
PM-025 UPS Battery Degradation Causes Rack Power Loss During Utility Blip SEV-3 35m Infrastructure datacenter
PM-026 AWS Credentials Committed to Public Repo — Caught by Pre-Commit Hook Near-Miss 0m Human Error ci-cd, security
PM-027 Wildcard Ingress Rule Nearly Exposes Internal Admin Panel Near-Miss 0m Configuration k8s, security
PM-028 Go Dependency Update Silently Changes Default Timeout — Caught in Canary Near-Miss 0m Software Bug k8s
PM-029 S3 Bucket Policy Change Nearly Deletes All Backup Archives Near-Miss 0m Infrastructure cloud, storage
PM-030 Alert Routing Sends All Pages to Decommissioned Channel Near-Miss 0m Process Gap observability, ops

Severity Distribution

Severity Count Description
SEV-1 5 Customer-facing outage, revenue impact, >1 hour
SEV-2 10 Partial degradation, customer notices, <1 hour
SEV-3 10 Internal impact, caught before customers noticed
Near-Miss 5 No customer impact, but would have been severe

Root Cause Distribution

Category Count IDs
Human Error 5 PM-001, PM-006, PM-016, PM-022, PM-026
Configuration 5 PM-002, PM-007, PM-013, PM-017, PM-027
Software Bug 5 PM-003, PM-008, PM-018, PM-024, PM-028
Infrastructure 5 PM-004, PM-009, PM-019, PM-025, PM-029
Process Gap 5 PM-010, PM-012, PM-020, PM-023, PM-030
Design Flaw 5 PM-005, PM-011, PM-014, PM-015, PM-021

How to Use

  • Learning postmortem writing: Read 5 postmortems across different severity levels, then write one for a case study from training/library/case-studies/
  • Incident analysis practice: Read only the Timeline section, try to identify the root cause before reading the Root Cause section
  • Action item quality: Compare action items across postmortems — which are specific and measurable vs. vague and aspirational?
  • Detection gap analysis: For each postmortem, ask: "What monitoring would have caught this sooner?"
  • Contributing factors exercise: Cover the "Contributing Factors" section, read the timeline, and list what you think made the incident worse
  • Lucky breaks awareness: Read the "What We Got Lucky About" sections — these reveal latent risks still present in most systems