War Stories Collection¶

First-person accounts of production incidents, migrations, mysteries, close calls, and hard lessons. Each story teaches something you can only learn from experience — or from someone else's experience.

#	Title	Category	Domains	Key Lesson
1	The 3 AM Cert Expiry	The Incident	tls, monitoring	Cert monitoring must check the actual endpoint users hit
2	The Deploy That Ate Prod	The Incident	kubernetes, ci-cd	Diff configs before deploy — memory limits are not optional
3	DNS: The Eternal Enemy	The Incident	dns, networking	Lower TTL before migration, verify from outside your network
4	The Database That Wasn't Backing Up	The Incident	database-ops, backup-restore	Backup is not a backup until you've restored from it
5	The Load Balancer Lie	The Incident	load-balancing, monitoring	Health checks should test real functionality
6	When the Queue Backed Up	The Incident	message-queues, capacity-planning	Queue depth monitoring with alerts is non-negotiable
7	The Permissions Avalanche	The Incident	iam, security	Test IAM changes in staging, have break-glass procedures
8	The Log That Filled the Disk	The Incident	logging, linux-ops	Log rotation is not optional, separate log partitions
9	The Clock Skew Catastrophe	The Incident	linux-ops, distributed-systems	NTP is critical infrastructure — monitor clock skew
10	The Firewall Rule That Blocked Itself	The Incident	firewalls, networking	Never test firewall rules on your management interface first
11	The Rollback That Wasn't	The Incident	ci-cd, database-ops	Migrations must be backward-compatible — test rollbacks
12	The Zombie Cron Job	The Incident	cron, linux-ops	Audit your cron jobs — every one needs an owner
13	The Memory Leak Marathon	The Incident	linux-performance, containers	Long-running soak tests catch what staging won't
14	The Split-Brain Nightmare	The Incident	distributed-systems, database-ops	Understand your consensus protocol before you need it
15	The Cascading Timeout	The Incident	microservices, networking	Circuit breakers are mandatory in microservice architectures
16	The Kubernetes Migration That Took a Year	The Migration	kubernetes, containers	Stateful workloads triple your timeline
17	From Monolith to Misery	The Migration	microservices, architecture	Start with 3-5 services, not 40
18	The Cloud Bill Surprise	The Migration	cloud-ops, finops	Model costs before migrating — egress charges are real
19	The Database Migration Weekend	The Migration	database-ops, postgresql	Do a full trial migration first — data types don't map cleanly
20	The CI/CD Pipeline Rewrite	The Migration	ci-cd, jenkins	Standardize before you migrate — don't port bad patterns
21	The Datacenter Exit	The Migration	datacenter, cloud-ops	Unknown dependencies are the real risk
22	The Terraform State Disaster	The Migration	terraform, infrastructure-as-code	Import one resource at a time, lock your state file
23	The Observability Migration	The Migration	monitoring, observability	Migrate alerts gradually by service
24	The DNS Provider Switch	The Migration	dns, networking	Audit ALL zones before migrating
25	The Auth System Swap	The Migration	security, identity	Plan the token transition — support both formats
26	It Was Always DNS	The Mystery	dns, networking	Check DNS first — it's always DNS
27	The Phantom Latency Spike	The Mystery	linux-performance, networking	Correlate with time patterns — shared resources cause shared pain
28	The Container That Worked on My Machine	The Mystery	containers, linux-ops	Security profiles differ between environments
29	The Metrics That Lied	The Mystery	monitoring, observability	Averages hide reality — use percentiles
30	The Leap Second Incident	The Mystery	linux-ops, distributed-systems	Keep systems patched — leap seconds are real edge cases
31	The Case of the Missing Packets	The Mystery	networking, firewalls	Monitor conntrack usage — kernel defaults aren't production-ready
32	The Git Deploy That Deployed Nothing	The Mystery	ci-cd, git	Shallow clones can surprise you — verify what you deployed
33	The SSL Handshake Timeout	The Mystery	tls, networking	MTU mismatches cause bizarre symptoms
34	The Intern and the DROP TABLE	The Close Call	database-ops, security	Principle of least privilege — production access controls save lives
35	One Character from Disaster	The Close Call	ansible, infrastructure-as-code	Code review for infrastructure — limit blast radius
36	The Monitoring Save	The Close Call	monitoring, disk-and-storage	Proactive monitoring pays for itself
37	The Terraform Plan That Would Have Destroyed Prod	The Close Call	terraform, cloud-ops	Always read the plan — never auto-apply
38	The Secrets in the Repo	The Close Call	security, git	Defense in depth — don't rely on a single control
39	The Network Change Window	The Close Call	networking, change-management	Change review processes exist for a reason
40	The Autoscaler That Almost Bankrupted Us	The Close Call	kubernetes, finops	Always set max replicas — cost monitoring matters
41	The Test We Never Wrote	The Hard Lesson	ci-cd, testing	Integration tests catch what unit tests can't
42	The Documentation That Didn't Exist	The Hard Lesson	runbooks, incident-command	Documentation is an investment, not overhead
43	The Technical Debt Interest Payment	The Hard Lesson	architecture, sre-practices	Tech debt compounds like financial debt
44	The Monitoring We Ignored	The Hard Lesson	monitoring, alerting	Alert fatigue kills — fix or remove noisy alerts
45	The Single Point of Failure	The Hard Lesson	architecture, high-availability	HA is insurance, not luxury
46	The Config Management Lie	The Hard Lesson	ansible, configuration-management	Config drift is a silent killer — enforce immutability
47	The Postmortem Nobody Read	The Hard Lesson	incident-command, sre-practices	Postmortem actions need owners and deadlines
48	The Cost of No Staging	The Hard Lesson	environments, ci-cd	Staging environments aren't luxury — they're insurance
49	The Secret Rotation We Postponed	The Hard Lesson	secrets-management, security	Rotate secrets regularly — hardcoded secrets are a time bomb
50	The Backup We Never Tested	The Hard Lesson	backup-restore, disaster-recovery	Backup success != restore success

How to Use¶

Interview prep: Read 5 stories before an interview — you'll have better "tell me about a time" answers
On-call prep: Read the Incident and Close Call stories before your first on-call rotation
Team discussion: Pick a story for a team brown-bag and discuss what you'd have done differently
Pattern recognition: After reading several stories, you'll start recognizing the warning signs before they become incidents

Categories¶

The Incident (15 stories)¶

Something broke in production. The 3 AM page, the wrong assumption, the fix, the aftermath.

The Migration (10 stories)¶

Moving from one system to another. The plan that didn't survive contact with reality.

The Mystery (8 stories)¶

Baffling behavior that took days to diagnose. Misleading evidence and lucky breaks.

War Stories Collection¶

How to Use¶

Categories¶

The Incident (15 stories)¶

The Migration (10 stories)¶

The Mystery (8 stories)¶

The Close Call (7 stories)¶

The Hard Lesson (10 stories)¶

Pages that link here¶