War Stories Collection¶
First-person accounts of production incidents, migrations, mysteries, close calls, and hard lessons. Each story teaches something you can only learn from experience — or from someone else's experience.
| # | Title | Category | Domains | Key Lesson |
|---|---|---|---|---|
| 1 | The 3 AM Cert Expiry | The Incident | tls, monitoring | Cert monitoring must check the actual endpoint users hit |
| 2 | The Deploy That Ate Prod | The Incident | kubernetes, ci-cd | Diff configs before deploy — memory limits are not optional |
| 3 | DNS: The Eternal Enemy | The Incident | dns, networking | Lower TTL before migration, verify from outside your network |
| 4 | The Database That Wasn't Backing Up | The Incident | database-ops, backup-restore | Backup is not a backup until you've restored from it |
| 5 | The Load Balancer Lie | The Incident | load-balancing, monitoring | Health checks should test real functionality |
| 6 | When the Queue Backed Up | The Incident | message-queues, capacity-planning | Queue depth monitoring with alerts is non-negotiable |
| 7 | The Permissions Avalanche | The Incident | iam, security | Test IAM changes in staging, have break-glass procedures |
| 8 | The Log That Filled the Disk | The Incident | logging, linux-ops | Log rotation is not optional, separate log partitions |
| 9 | The Clock Skew Catastrophe | The Incident | linux-ops, distributed-systems | NTP is critical infrastructure — monitor clock skew |
| 10 | The Firewall Rule That Blocked Itself | The Incident | firewalls, networking | Never test firewall rules on your management interface first |
| 11 | The Rollback That Wasn't | The Incident | ci-cd, database-ops | Migrations must be backward-compatible — test rollbacks |
| 12 | The Zombie Cron Job | The Incident | cron, linux-ops | Audit your cron jobs — every one needs an owner |
| 13 | The Memory Leak Marathon | The Incident | linux-performance, containers | Long-running soak tests catch what staging won't |
| 14 | The Split-Brain Nightmare | The Incident | distributed-systems, database-ops | Understand your consensus protocol before you need it |
| 15 | The Cascading Timeout | The Incident | microservices, networking | Circuit breakers are mandatory in microservice architectures |
| 16 | The Kubernetes Migration That Took a Year | The Migration | kubernetes, containers | Stateful workloads triple your timeline |
| 17 | From Monolith to Misery | The Migration | microservices, architecture | Start with 3-5 services, not 40 |
| 18 | The Cloud Bill Surprise | The Migration | cloud-ops, finops | Model costs before migrating — egress charges are real |
| 19 | The Database Migration Weekend | The Migration | database-ops, postgresql | Do a full trial migration first — data types don't map cleanly |
| 20 | The CI/CD Pipeline Rewrite | The Migration | ci-cd, jenkins | Standardize before you migrate — don't port bad patterns |
| 21 | The Datacenter Exit | The Migration | datacenter, cloud-ops | Unknown dependencies are the real risk |
| 22 | The Terraform State Disaster | The Migration | terraform, infrastructure-as-code | Import one resource at a time, lock your state file |
| 23 | The Observability Migration | The Migration | monitoring, observability | Migrate alerts gradually by service |
| 24 | The DNS Provider Switch | The Migration | dns, networking | Audit ALL zones before migrating |
| 25 | The Auth System Swap | The Migration | security, identity | Plan the token transition — support both formats |
| 26 | It Was Always DNS | The Mystery | dns, networking | Check DNS first — it's always DNS |
| 27 | The Phantom Latency Spike | The Mystery | linux-performance, networking | Correlate with time patterns — shared resources cause shared pain |
| 28 | The Container That Worked on My Machine | The Mystery | containers, linux-ops | Security profiles differ between environments |
| 29 | The Metrics That Lied | The Mystery | monitoring, observability | Averages hide reality — use percentiles |
| 30 | The Leap Second Incident | The Mystery | linux-ops, distributed-systems | Keep systems patched — leap seconds are real edge cases |
| 31 | The Case of the Missing Packets | The Mystery | networking, firewalls | Monitor conntrack usage — kernel defaults aren't production-ready |
| 32 | The Git Deploy That Deployed Nothing | The Mystery | ci-cd, git | Shallow clones can surprise you — verify what you deployed |
| 33 | The SSL Handshake Timeout | The Mystery | tls, networking | MTU mismatches cause bizarre symptoms |
| 34 | The Intern and the DROP TABLE | The Close Call | database-ops, security | Principle of least privilege — production access controls save lives |
| 35 | One Character from Disaster | The Close Call | ansible, infrastructure-as-code | Code review for infrastructure — limit blast radius |
| 36 | The Monitoring Save | The Close Call | monitoring, disk-and-storage | Proactive monitoring pays for itself |
| 37 | The Terraform Plan That Would Have Destroyed Prod | The Close Call | terraform, cloud-ops | Always read the plan — never auto-apply |
| 38 | The Secrets in the Repo | The Close Call | security, git | Defense in depth — don't rely on a single control |
| 39 | The Network Change Window | The Close Call | networking, change-management | Change review processes exist for a reason |
| 40 | The Autoscaler That Almost Bankrupted Us | The Close Call | kubernetes, finops | Always set max replicas — cost monitoring matters |
| 41 | The Test We Never Wrote | The Hard Lesson | ci-cd, testing | Integration tests catch what unit tests can't |
| 42 | The Documentation That Didn't Exist | The Hard Lesson | runbooks, incident-command | Documentation is an investment, not overhead |
| 43 | The Technical Debt Interest Payment | The Hard Lesson | architecture, sre-practices | Tech debt compounds like financial debt |
| 44 | The Monitoring We Ignored | The Hard Lesson | monitoring, alerting | Alert fatigue kills — fix or remove noisy alerts |
| 45 | The Single Point of Failure | The Hard Lesson | architecture, high-availability | HA is insurance, not luxury |
| 46 | The Config Management Lie | The Hard Lesson | ansible, configuration-management | Config drift is a silent killer — enforce immutability |
| 47 | The Postmortem Nobody Read | The Hard Lesson | incident-command, sre-practices | Postmortem actions need owners and deadlines |
| 48 | The Cost of No Staging | The Hard Lesson | environments, ci-cd | Staging environments aren't luxury — they're insurance |
| 49 | The Secret Rotation We Postponed | The Hard Lesson | secrets-management, security | Rotate secrets regularly — hardcoded secrets are a time bomb |
| 50 | The Backup We Never Tested | The Hard Lesson | backup-restore, disaster-recovery | Backup success != restore success |
How to Use¶
- Interview prep: Read 5 stories before an interview — you'll have better "tell me about a time" answers
- On-call prep: Read the Incident and Close Call stories before your first on-call rotation
- Team discussion: Pick a story for a team brown-bag and discuss what you'd have done differently
- Pattern recognition: After reading several stories, you'll start recognizing the warning signs before they become incidents
Categories¶
The Incident (15 stories)¶
Something broke in production. The 3 AM page, the wrong assumption, the fix, the aftermath.
The Migration (10 stories)¶
Moving from one system to another. The plan that didn't survive contact with reality.
The Mystery (8 stories)¶
Baffling behavior that took days to diagnose. Misleading evidence and lucky breaks.
The Close Call (7 stories)¶
Disasters narrowly avoided. The safety net that held.
The Hard Lesson (10 stories)¶
Learning the hard way. Shortcuts that compounded, debt that came due.
Pages that link here¶
- DNS: The Eternal Enemy
- From Monolith to Misery
- It Was Always DNS
- One Character from Disaster
- The 3 AM Cert Expiry
- The Auth System Swap
- The Autoscaler That Almost Bankrupted Us
- The Backup We Never Tested
- The Cascading Timeout
- The Case of the Missing Packets
- The Clock Skew Catastrophe
- The Cloud Bill Surprise
- The Config Management Lie
- The Container That Worked on My Machine
- The DNS Provider Switch