Ops War Stories & Pattern Recognition Footguns¶

Mistakes that turn investigations into wild goose chases, incidents into marathons, and hard-won experience into overconfidence.

1. Anchoring on the first theory¶

The error rate spiked. Someone deployed 30 minutes ago. You spend an hour dissecting the deploy. The deploy is fine. The actual cause was a dependency that started returning errors 25 minutes ago — 5 minutes before the deploy. But you were so locked onto "it's the deploy" that you never checked dependencies.

Fix: Write down three hypotheses before investigating any of them. Spend 2 minutes on a quick check of each before going deep on one. If your first theory doesn't pan out in 15 minutes, step back and reconsider the list.

2. Trusting monitoring over user reports¶

Users report the site is slow. Monitoring shows green across the board. You tell users "everything looks fine on our end." Actually, monitoring is checking health endpoints that return a static 200. The real API is timing out for 30% of requests, but monitoring doesn't test real user flows.

Fix: When users say it's broken and monitoring says it's fine, the monitoring is wrong. Always validate user reports with real requests: curl -w "%{time_total}" https://api.example.com/real-endpoint. Then fix your monitoring to cover what users actually experience.

3. Restarting instead of diagnosing¶

Process is consuming 8GB of memory. You restart it. Memory drops to 500MB. Two days later, it's at 8GB again. You restart it again. This becomes a weekly ritual. You've accepted toil instead of finding the memory leak.

Fix: A restart that fixes something temporarily is a clue, not a solution. Before restarting: capture a heap dump, thread dump, or at minimum the process state (/proc/PID/status, pmap PID). Open a ticket for the root cause. Track "restart as remediation" as toil.

4. Investigating the symptom, not the cause chain¶

Alert says: "CPU at 95%." You investigate CPU-intensive processes. But the CPU is high because the application is retrying failed database queries in a tight loop. The database queries are failing because the connection pool is exhausted. The pool is exhausted because a long-running query is holding all connections. You spent 30 minutes on CPU when the problem was a database query.

Fix: Follow the dependency chain. "CPU high" is a symptom. Ask: "Why is CPU high?" Then: "Why is that process doing so much work?" Keep asking until you find something you can fix. The first alert is almost never the root cause.

5. "It was fine last time" — pattern-matching to the wrong incident¶

Two months ago, high latency was caused by DNS. Today, you see high latency and immediately check DNS. DNS is fine. You check it three more times because "it's always DNS." Meanwhile, the actual cause — a saturated network link — goes undiagnosed for an hour.

Fix: Past experience is a probability guide, not a certainty. Use it to order your investigation (check DNS first because it's often DNS), but don't let it become tunnel vision. If your top hypothesis is clean after 5 minutes, move on.

6. Investigating solo for too long¶

You've been debugging for 90 minutes. You're deep in logs, switching between five terminals, and you're sure you're close. You're not close. You're anchored, fatigued, and missing obvious things. A fresh pair of eyes would have caught it in 10 minutes.

Fix: Hard rule: if you haven't identified root cause in 15 minutes, escalate or pull in a second person. This is not a failure — it's the fastest path to resolution. Two people debugging for 15 minutes beats one person debugging for 90.

7. Not checking the simple things first¶

Service is unreachable. You start tracing network paths, checking firewall rules, analyzing packet captures. Forty-five minutes in, someone asks "is the disk full?" Yes. The disk is full. The application couldn't write to its socket file and crashed. The disk was full because a log file grew to 50GB overnight.

Fix: Always check the boring things first: disk space, memory, process status, DNS, and time sync. These take 30 seconds and explain the majority of incidents. Exotic debugging tools are for exotic problems, and most problems aren't exotic.

8. Fixing the wrong thing because the timeline is off¶

You see an error in the logs at 14:05. You see a config change at 14:03. You blame the config change. But the error started at 13:55 — you just found the first occurrence you noticed, not the actual first occurrence. The config change was unrelated.

Fix: Establish the true timeline before assigning cause. Use precise timestamps. "When did the error rate actually start increasing?" Check monitoring graphs, not log grep. A graph shows the inflection point; log searches show the first line you happened to find.

9. Assuming the monitoring system is healthy¶

Fifty alerts fire simultaneously. You triage them as a massive outage. Actually, Prometheus crashed and restarted. On restart, it re-evaluated all alerting rules against stale data and fired alerts for conditions that already resolved. The "outage" is a monitoring system artifact.

Fix: During an alert storm, check monitoring system health first. Is Prometheus up? Is Alertmanager up? When did they last restart? Are the data sources current? If the monitoring system itself is unhealthy, its alerts are unreliable.

10. Never writing it down¶

You diagnose a tricky issue involving filesystem inode exhaustion combined with a cron job that created millions of tiny temp files. Brilliant debugging. You fix it. You move on. Six months later, the same thing happens. You vaguely remember this but can't remember the solution. You spend another hour rediscovering it.

Fix: After every non-trivial diagnosis, write a one-paragraph note: symptom, cause, fix. Put it in a runbook, a wiki, or even a text file. Your future self (and your team) will thank you. The knowledge that stays in your head leaves when you do.