Skip to content

War Stories Collection

First-person accounts of production incidents, migrations, mysteries, close calls, and hard lessons. Each story teaches something you can only learn from experience — or from someone else's experience.

# Title Category Domains Key Lesson
1 The 3 AM Cert Expiry The Incident tls, monitoring Cert monitoring must check the actual endpoint users hit
2 The Deploy That Ate Prod The Incident kubernetes, ci-cd Diff configs before deploy — memory limits are not optional
3 DNS: The Eternal Enemy The Incident dns, networking Lower TTL before migration, verify from outside your network
4 The Database That Wasn't Backing Up The Incident database-ops, backup-restore Backup is not a backup until you've restored from it
5 The Load Balancer Lie The Incident load-balancing, monitoring Health checks should test real functionality
6 When the Queue Backed Up The Incident message-queues, capacity-planning Queue depth monitoring with alerts is non-negotiable
7 The Permissions Avalanche The Incident iam, security Test IAM changes in staging, have break-glass procedures
8 The Log That Filled the Disk The Incident logging, linux-ops Log rotation is not optional, separate log partitions
9 The Clock Skew Catastrophe The Incident linux-ops, distributed-systems NTP is critical infrastructure — monitor clock skew
10 The Firewall Rule That Blocked Itself The Incident firewalls, networking Never test firewall rules on your management interface first
11 The Rollback That Wasn't The Incident ci-cd, database-ops Migrations must be backward-compatible — test rollbacks
12 The Zombie Cron Job The Incident cron, linux-ops Audit your cron jobs — every one needs an owner
13 The Memory Leak Marathon The Incident linux-performance, containers Long-running soak tests catch what staging won't
14 The Split-Brain Nightmare The Incident distributed-systems, database-ops Understand your consensus protocol before you need it
15 The Cascading Timeout The Incident microservices, networking Circuit breakers are mandatory in microservice architectures
16 The Kubernetes Migration That Took a Year The Migration kubernetes, containers Stateful workloads triple your timeline
17 From Monolith to Misery The Migration microservices, architecture Start with 3-5 services, not 40
18 The Cloud Bill Surprise The Migration cloud-ops, finops Model costs before migrating — egress charges are real
19 The Database Migration Weekend The Migration database-ops, postgresql Do a full trial migration first — data types don't map cleanly
20 The CI/CD Pipeline Rewrite The Migration ci-cd, jenkins Standardize before you migrate — don't port bad patterns
21 The Datacenter Exit The Migration datacenter, cloud-ops Unknown dependencies are the real risk
22 The Terraform State Disaster The Migration terraform, infrastructure-as-code Import one resource at a time, lock your state file
23 The Observability Migration The Migration monitoring, observability Migrate alerts gradually by service
24 The DNS Provider Switch The Migration dns, networking Audit ALL zones before migrating
25 The Auth System Swap The Migration security, identity Plan the token transition — support both formats
26 It Was Always DNS The Mystery dns, networking Check DNS first — it's always DNS
27 The Phantom Latency Spike The Mystery linux-performance, networking Correlate with time patterns — shared resources cause shared pain
28 The Container That Worked on My Machine The Mystery containers, linux-ops Security profiles differ between environments
29 The Metrics That Lied The Mystery monitoring, observability Averages hide reality — use percentiles
30 The Leap Second Incident The Mystery linux-ops, distributed-systems Keep systems patched — leap seconds are real edge cases
31 The Case of the Missing Packets The Mystery networking, firewalls Monitor conntrack usage — kernel defaults aren't production-ready
32 The Git Deploy That Deployed Nothing The Mystery ci-cd, git Shallow clones can surprise you — verify what you deployed
33 The SSL Handshake Timeout The Mystery tls, networking MTU mismatches cause bizarre symptoms
34 The Intern and the DROP TABLE The Close Call database-ops, security Principle of least privilege — production access controls save lives
35 One Character from Disaster The Close Call ansible, infrastructure-as-code Code review for infrastructure — limit blast radius
36 The Monitoring Save The Close Call monitoring, disk-and-storage Proactive monitoring pays for itself
37 The Terraform Plan That Would Have Destroyed Prod The Close Call terraform, cloud-ops Always read the plan — never auto-apply
38 The Secrets in the Repo The Close Call security, git Defense in depth — don't rely on a single control
39 The Network Change Window The Close Call networking, change-management Change review processes exist for a reason
40 The Autoscaler That Almost Bankrupted Us The Close Call kubernetes, finops Always set max replicas — cost monitoring matters
41 The Test We Never Wrote The Hard Lesson ci-cd, testing Integration tests catch what unit tests can't
42 The Documentation That Didn't Exist The Hard Lesson runbooks, incident-command Documentation is an investment, not overhead
43 The Technical Debt Interest Payment The Hard Lesson architecture, sre-practices Tech debt compounds like financial debt
44 The Monitoring We Ignored The Hard Lesson monitoring, alerting Alert fatigue kills — fix or remove noisy alerts
45 The Single Point of Failure The Hard Lesson architecture, high-availability HA is insurance, not luxury
46 The Config Management Lie The Hard Lesson ansible, configuration-management Config drift is a silent killer — enforce immutability
47 The Postmortem Nobody Read The Hard Lesson incident-command, sre-practices Postmortem actions need owners and deadlines
48 The Cost of No Staging The Hard Lesson environments, ci-cd Staging environments aren't luxury — they're insurance
49 The Secret Rotation We Postponed The Hard Lesson secrets-management, security Rotate secrets regularly — hardcoded secrets are a time bomb
50 The Backup We Never Tested The Hard Lesson backup-restore, disaster-recovery Backup success != restore success

How to Use

  • Interview prep: Read 5 stories before an interview — you'll have better "tell me about a time" answers
  • On-call prep: Read the Incident and Close Call stories before your first on-call rotation
  • Team discussion: Pick a story for a team brown-bag and discuss what you'd have done differently
  • Pattern recognition: After reading several stories, you'll start recognizing the warning signs before they become incidents

Categories

The Incident (15 stories)

Something broke in production. The 3 AM page, the wrong assumption, the fix, the aftermath.

The Migration (10 stories)

Moving from one system to another. The plan that didn't survive contact with reality.

The Mystery (8 stories)

Baffling behavior that took days to diagnose. Misleading evidence and lucky breaks.

The Close Call (7 stories)

Disasters narrowly avoided. The safety net that held.

The Hard Lesson (10 stories)

Learning the hard way. Shortcuts that compounded, debt that came due.