Portal | Level: L2: Operations | Topics: Ops War Stories, Incident Response | Domain: DevOps & Tooling
Ops War Stories & Pattern Recognition - Primer¶
Why This Matters¶
After twenty years in infrastructure, you learn that incidents are not random. They follow patterns. The same failure modes repeat across different systems, different companies, and different decades. A senior engineer isn't someone who's seen everything — they're someone who recognizes the pattern faster because they've been burned by a variant of it before.
This pack is not a technical reference. It's a pattern library — a collection of symptom-to-diagnosis heuristics that transfer 20 years of firefighting intuition into teachable frameworks. The goal is to make you faster at diagnosis, not by memorizing answers, but by recognizing shapes.
Core Principles¶
1. The Differential Diagnosis Model¶
Doctors don't guess a disease and then check if they're right. They generate a list of possible causes ranked by probability, then systematically rule them out. Infrastructure debugging should work the same way.
The Diagnostic Sequence:
1. Observe symptoms (what is the user/system experiencing?)
2. Generate hypotheses (what could cause this?)
3. Rank by probability (what's most likely given the evidence?)
4. Test the most likely hypothesis first (cheapest/fastest check)
5. If confirmed: remediate
6. If not: eliminate and move to next hypothesis
7. If all hypotheses eliminated: you're missing information. Widen your view.
2. The Five Most Common Causes of Almost Everything¶
When something breaks, these five categories explain 80% of incidents:
| Rank | Cause Category | Frequency | How to Check |
|---|---|---|---|
| 1 | Recent change (deploy, config, infra) | ~40% | git log, deploy history, change log |
| 2 | Resource exhaustion (disk, memory, CPU, connections) | ~25% | df, free, top, ss, ulimit |
| 3 | Dependency failure (upstream service, DNS, database) | ~15% | health checks, connectivity tests |
| 4 | Time-based trigger (cron, cert expiry, log rotation) | ~10% | crontab, cert dates, last rotation |
| 5 | Traffic/load spike (organic growth or event) | ~10% | request rate graphs, LB metrics |
The First Question Rule:
Before you investigate ANYTHING, ask:
"Was anything deployed or changed in the last 4 hours?"
If yes: strong correlation with the incident.
Rollback first, investigate second.
This single question resolves ~40% of incidents within minutes.
3. Pattern Library — Disk¶
"The disk is full but du says it's not"¶
Symptom: df shows 100% disk usage. du -sh /* totals to 60%.
40% of the disk is "missing."
Differential:
1. Deleted files still held open by a process (MOST COMMON)
→ lsof +D /var/log | grep deleted
→ Fix: restart the process holding the file descriptor
→ Why: Unix doesn't free disk space until ALL file descriptors are closed
2. Filesystem reserved blocks (ext4 reserves 5% for root by default)
→ tune2fs -l /dev/sda1 | grep "Reserved block count"
→ Fix: tune2fs -m 1 /dev/sda1 (reduce to 1%, careful)
3. Hidden mount covering existing files
→ Something is mounted over a directory that has data under it
→ umount /mnt/data, check what's underneath
→ Fix: move data or change mount point
4. Filesystem metadata or journal
→ Large journals on ext4/xfs consume space du doesn't report
→ Check: dumpe2fs /dev/sda1 | grep "Journal size"
"Disk I/O is slow but iostat looks fine"¶
Symptom: Application reports slow writes. iostat shows < 50% utilization.
Differential:
1. I/O scheduler contention (multiple processes competing)
→ iotop (shows per-process I/O)
→ One process may be doing sequential I/O that preempts your app's random I/O
2. Filesystem journaling overhead
→ Writes go to journal first, then data blocks
→ EXT4 with data=journal doubles write amplification
→ Check: mount | grep "data=journal"
3. RAID rebuild in progress
→ cat /proc/mdstat
→ Rebuild steals I/O bandwidth from production workload
4. Thin-provisioned storage doing copy-on-write
→ LVM thin pools, ZFS, cloud EBS volumes
→ First write to a block is slower than subsequent writes
→ Check: lvs -a (look for thin pool utilization)
5. NFS or network filesystem masquerading as local
→ mount | grep nfs
→ Network latency appears as I/O latency
4. Pattern Library — Network¶
"Latency is high but CPU is idle"¶
Symptom: API response time is 2s. Server CPU at 10%. Memory fine. Disk fine.
Differential:
1. DNS resolution delays (EXTREMELY COMMON)
→ time nslookup api.dependency.com
→ If > 100ms: DNS is the bottleneck
→ Fix: add DNS caching (nscd, systemd-resolved, dnsmasq)
2. Connection pool exhaustion — waiting for connections
→ Application waiting for a database/Redis/HTTP connection
→ Check: connection pool metrics, or ss -tnp | grep ESTABLISHED | wc -l
→ Fix: increase pool size or fix connection leaks
3. Upstream service is slow
→ Your service is fast but waits on a dependency
→ Check: trace outbound request latency per dependency
→ Fix: add timeouts, circuit breakers, or cache the dependency
4. TCP retransmissions (packet loss)
→ netstat -s | grep retransmit
→ Even 0.5% packet loss causes massive TCP latency
→ Fix: check for congested links, MTU mismatches, bad cables
5. Garbage collection pauses (JVM, Go, .NET)
→ Application thread is frozen during GC
→ CPU looks idle because GC is a Stop-The-World pause
→ Check: GC logs, GC pause metrics
"Connections are refused but the service is running"¶
Symptom: curl gets "Connection refused." Service is UP in systemctl.
Differential:
1. Service is listening on wrong interface
→ ss -tlnp | grep <port>
→ Listening on 127.0.0.1 but client connecting to 10.0.1.50
→ Fix: bind to 0.0.0.0 or the correct interface
2. Firewall/security group blocking
→ iptables -L -n | grep <port>
→ Cloud security group rules
→ Fix: add the allow rule
3. Service listen backlog full
→ Under heavy load, the TCP listen backlog overflows
→ ss -tlnp shows the service but new connections are refused
→ Fix: increase net.core.somaxconn and application backlog
4. File descriptor limit reached
→ ulimit -n (check per-process limit)
→ cat /proc/$(pgrep myapp)/limits | grep "Max open files"
→ Service can't accept new connections because it can't open new sockets
→ Fix: increase limits in systemd unit or /etc/security/limits.conf
5. Service crashed and systemd restarted it on a different port
→ journalctl -u myservice --since "10 minutes ago"
→ Check for port conflicts or config changes
5. Pattern Library — System¶
"The server rebooted but nobody touched it"¶
Symptom: Server was down for 3 minutes. Nobody did maintenance.
Differential:
1. OOM killer invoked, took out a critical process
→ dmesg | grep -i "out of memory"
→ journalctl --since "1 hour ago" | grep -i oom
→ The OOM kill itself doesn't reboot, but if it kills systemd, PID 1, or
the watchdog process, the result can be a reboot or hang
2. Kernel panic
→ Check serial console logs if available
→ cat /var/log/kern.log or journalctl -k -b -1 (previous boot)
→ Often caused by hardware failure or buggy kernel module
3. Hardware watchdog timeout
→ If the system stops feeding the hardware watchdog, it reboots
→ dmesg | grep watchdog
→ Common with heavy I/O load that starves the watchdog timer
4. Unattended OS updates with auto-reboot
→ Check: /var/log/unattended-upgrades/ or dnf history
→ needrestart or kexec may have triggered a reboot
5. Power event (UPS, PDU, cloud host maintenance)
→ Check UPS logs, cloud console event log
→ AWS: aws ec2 describe-instance-status --include-all-instances
"Process keeps dying and nobody knows why"¶
Symptom: Application crashes every few hours. No error in app logs.
Differential:
1. OOM killer (most common for silent kills)
→ dmesg | grep "Killed process"
→ Shows PID and memory usage at kill time
→ Fix: increase memory limits, fix memory leak, or add swap
2. Segfault or signal
→ dmesg | grep segfault
→ coredumpctl list (systemd core dump journal)
→ Fix: update the binary, check shared library compatibility
3. Resource limit (file descriptors, processes, memory)
→ cat /proc/$(pgrep myapp)/limits
→ If limits are hit, the process dies without logging
→ Fix: increase limits in systemd unit file
4. Systemd killing it (TimeoutStopSec, WatchdogSec)
→ journalctl -u myservice | grep -i timeout
→ Systemd sends SIGKILL if the service doesn't stop in time
→ Fix: increase TimeoutStopSec or fix graceful shutdown
5. Another process killing it
→ auditctl -a always,exit -F arch=b64 -S kill -S tkill -S tgkill
→ Then: ausearch -sc kill (find who sent the signal)
→ Monitoring agents, deployment tools, or cron jobs may be the culprit
6. The Investigation Anti-Patterns¶
| Anti-Pattern | What Happens | Better Approach |
|---|---|---|
| Anchoring | First theory becomes the only theory | Write down 3 hypotheses before testing any |
| Confirmation bias | Only looking for evidence that supports your theory | Actively try to disprove your theory |
| Recency bias | "This is the same issue as last week" (but it isn't) | Verify with evidence before assuming |
| Tunnel vision | Deep-diving one component while ignoring others | Set a 15-minute timebox, then step back |
| Heroics | One person debugging for 2 hours solo | Escalate at 15 minutes if no progress |
Common Pitfalls¶
- Starting with "what changed?" and stopping there — It covers 40% of incidents, but the other 60% require systematic diagnosis. Don't let the first question become the only question.
- Ignoring the simple stuff — 90% of "mysterious" problems are disk full, memory exhausted, or DNS broken. Check the boring things first.
- Not writing down your hypotheses — In a war room, hypotheses get lost. Write them on a whiteboard or in the incident channel. Cross them off as you eliminate them.
- Assuming the monitoring is correct — If monitoring says everything is fine but users say it's broken, trust the users. Your monitoring has blind spots.
- Investigating the alert instead of the symptom — The alert says "CPU high." But the actual problem is disk full, which caused a retry storm, which caused high CPU. Follow the chain to the root.
- Not capturing diagnostic data before remediating — You restart the service and the problem goes away. Now you can't investigate. Capture logs, thread dumps, connection state, and core dumps BEFORE you fix it.
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Related Content¶
- Change Management (Topic Pack, L1) — Incident Response
- Chaos Engineering Scripts (CLI) (Exercise Set, L2) — Incident Response
- Debugging Methodology (Topic Pack, L1) — Incident Response
- Incident Command & On-Call (Topic Pack, L2) — Incident Response
- Incident Response Flashcards (CLI) (flashcard_deck, L1) — Incident Response
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Incident Response
- Investigation Engine (CLI) (Exercise Set, L2) — Incident Response
- Operations War Stories Flashcards (CLI) (flashcard_deck, L1) — Ops War Stories
- Postmortems & SLOs (Topic Pack, L2) — Incident Response
- Runbook Craft (Topic Pack, L1) — Incident Response