Skip to content

Ops War Stories & Pattern Recognition - Street-Level Ops

Hard-won diagnostic heuristics from 20+ years of infrastructure firefighting. These are the patterns that save hours.

Quick Diagnosis Commands

# The "something is wrong but I don't know what" starter kit
uptime                                          # Load average: is it higher than core count?
free -h                                         # Memory: is swap in use? Is available low?
df -h                                           # Disk: is anything >85%?
dmesg -T | tail -30                             # Kernel messages: OOM kills? Hardware errors?
journalctl -p err --since "1 hour ago" | head -50  # Recent errors across all services
ss -s                                           # TCP connection summary: connection leaks?
ps aux --sort=-%mem | head -10                  # Top memory consumers
ps aux --sort=-%cpu | head -10                  # Top CPU consumers

# The "was anything changed recently?" check
last reboot | head -5                           # Recent reboots
rpm -qa --last | head -10 2>/dev/null           # Recent package installs (RHEL)
dpkg -l --no-pager 2>/dev/null | tail -10       # Recent package changes (Debian)
find /etc -mtime -1 -type f 2>/dev/null | head -20  # Config files changed in last 24h

# The "is it DNS?" check (it's always DNS)
time nslookup google.com                        # Should resolve in <50ms
cat /etc/resolv.conf                            # What nameservers are configured?
systemd-resolve --status 2>/dev/null | head -20 # systemd-resolved status

Gotcha: "It's Always DNS" — Except When It's Not

You assume it's DNS because it's always DNS. You spend 30 minutes investigating DNS. It's not DNS. It's a certificate that expired 20 minutes ago. By the time you check certs, the incident is 50 minutes old.

Fix: "It's always DNS" is a useful heuristic, not a religion. Spend 2 minutes checking DNS (nslookup, dig, check /etc/resolv.conf). If DNS is clean, move on. The quick check list: DNS, then certs, then recent deploys, then resource exhaustion. Each gets 2 minutes before you move to the next.

Gotcha: Restarting Before Capturing Evidence

Service is down. Your muscle memory says "restart it." You restart it. Service comes back. Everyone celebrates. Two hours later, it dies again. You still don't know why because you didn't capture the state before restarting.

Fix: Before restarting anything, capture:

# Capture process state
ps aux > /tmp/incident-$(date +%s)-ps.txt
cat /proc/$(pgrep myapp)/status > /tmp/incident-$(date +%s)-procstatus.txt
cat /proc/$(pgrep myapp)/fd 2>/dev/null | wc -l > /tmp/incident-$(date +%s)-fdcount.txt

# Capture memory and resource state
free -h > /tmp/incident-$(date +%s)-mem.txt
df -h > /tmp/incident-$(date +%s)-disk.txt
ss -tnp > /tmp/incident-$(date +%s)-connections.txt

# Capture kernel messages
dmesg -T > /tmp/incident-$(date +%s)-dmesg.txt

# Capture application logs (last 500 lines)
journalctl -u myservice --no-pager -n 500 > /tmp/incident-$(date +%s)-logs.txt

# THEN restart
systemctl restart myservice

Gotcha: The Log File That Ate the Disk

Disk is at 100%. You run du -sh /* and it shows 60GB used on a 100GB disk. You can't find the missing 40GB. Something is very wrong... or something deleted a log file while the process still has it open.

Fix:

# Find deleted files still held open (this is almost always the answer)
lsof +L1 | grep deleted
# Output: nginx 1234 root 5w REG 8,1 42949672960 (deleted)
# That's a 40GB deleted file still held open by nginx

# To reclaim space without restarting:
# Option 1: Truncate the deleted file via /proc
echo > /proc/1234/fd/5

# Option 2: Restart the process (releases the file descriptor)
systemctl restart nginx

# Prevention: configure logrotate with copytruncate
# This truncates the log in place instead of moving and re-creating

Gotcha: The Mystery Reboot

Server was down for 3 minutes at 2:47 AM. Nobody was on-call. Nobody did maintenance.

Fix: Work through the checklist in order:

# 1. Was it an OOM kill cascade?
dmesg | grep -i "out of memory"
journalctl -k -b -1 | grep -i oom     # Previous boot kernel messages

# 2. Was it a kernel panic?
journalctl -k -b -1 | tail -50        # Last kernel messages before reboot

# 3. Was it a hardware watchdog?
dmesg | grep -i watchdog

# 4. Was it an automatic update + reboot?
cat /var/log/unattended-upgrades/unattended-upgrades.log 2>/dev/null
dnf history 2>/dev/null | head -10
apt-get changelog 2>/dev/null | head -10

# 5. Was it a power event?
ipmitool sel list 2>/dev/null | tail -10    # IPMI system event log
# Or check cloud provider:
# AWS: aws ec2 describe-instance-status
# Check for "instance-reboot" or "system-reboot" events

# 6. Who was the last to log in?
last -10                                     # Recent login history

Pattern: The "It Works on My Machine" Ladder

When something works in dev/staging but fails in production:

Check in this order:

  1. Environment variables
     → Different config, different secrets, different endpoints
     → diff <(env | sort) <(ssh prod-server env | sort)

  2. DNS resolution
     → Dev resolves api.internal to 10.0.1.50
     → Prod resolves to 10.0.2.50 (different version of the service)

  3. Network path
     → Dev talks to the database directly
     → Prod goes through a proxy/load balancer/firewall

  4. Data volume
     → Dev has 100 rows. Prod has 10 million.
     → Query is fine on 100 rows, times out on 10M.

  5. Concurrency
     → Dev has 1 user. Prod has 10,000.
     → Race conditions and lock contention only appear at scale.

  6. Kernel/OS version
     → Dev runs Ubuntu 22.04. Prod runs RHEL 9.
     → Different syscall behavior, different library versions.

Pattern: The Slowness Onion

When everything is "a little slow" but nothing is obviously broken:

Layer 1: Is the host overloaded?
  uptime → load average > CPU core count = overloaded
  If yes: find the hog (top, ps aux --sort=-%cpu)

Layer 2: Is memory under pressure?
  free -h → if available < 10% of total, or swap is in use
  If yes: find the memory hog (ps aux --sort=-%mem)
  Swap in use = page faults = everything slow

Layer 3: Is I/O saturated?
  iostat -x 1 3 → %util > 80% on any device?
  If yes: iotop to find the I/O hog

Layer 4: Is the network the bottleneck?
  time curl -o /dev/null -s <dependency-url>
  If slow: check DNS, then check TCP retransmits, then check dependency

Layer 5: Is it an application-level issue?
  If layers 1-4 are clean: the slowness is in application code
  → Connection pools, query performance, GC pauses, lock contention

Pattern: The Five Whys (Done Right)

"The server went down" is a symptom. Five Whys finds the systemic cause:

1. Why did the server go down?
   → The application ran out of memory and the OOM killer killed it.

2. Why did the application run out of memory?
   → A memory leak accumulated over 5 days since the last deploy.

3. Why wasn't the memory leak caught before production?
   → Staging environment only runs for 2 hours before being torn down.

4. Why doesn't staging run long enough to catch slow leaks?
   → Staging is rebuilt on every deploy to save costs.

5. Why isn't there a long-running staging environment?
   → It was never prioritized because "staging works fine."

Root cause: Lack of long-running test environment for leak detection.
Action item: Maintain one staging instance that runs for 7+ days between rebuilds.

Emergency: Everything Is Broken and You Don't Know Where to Start

The "house is on fire" triage protocol:

  Minute 0-2: SCOPE
    - What's the user impact? (Service down? Degraded? Specific users?)
    - When did it start? (Narrow the time window)
    - What changed since then? (Deploy, config, infra, dependency)

  Minute 2-5: QUICK CHECKS
    - Is the host alive? (ping, SSH)
    - Is the process running? (systemctl status, docker ps)
    - Is the dependency alive? (database, cache, upstream API)
    - Is the disk full? (df -h)
    - Is the DNS working? (nslookup)

  Minute 5-10: CORRELATE
    - If recent change: rollback
    - If resource exhaustion: identify the consumer
    - If dependency failure: confirm and route to that team
    - If none of the above: escalate with what you know

  After minute 10:
    - If no progress: escalate. Do not keep debugging alone.
    - Two fresh eyes beat one tired pair.

Emergency: The Alert Storm

Fifty alerts fire in two minutes. PagerDuty is a wall of red. You don't know which to look at first.

1. Don't chase individual alerts. Find the common thread.
   → Are they all from the same host? Same service? Same dependency?
   → Group alerts by source. The group with the most alerts is your starting point.

2. Check for dependency failure (the "tree fell" pattern)
   → Database goes down → 20 services report errors → 50 alerts fire
   → Fix the database, not the 20 services.

3. Check for network partition
   → Monitoring server loses connectivity to a rack/zone
   → Every host in that zone fires "unreachable" alerts
   → Check from a different vantage point before assuming all hosts are down.

4. Check for monitoring system failure
   → Sometimes the alert storm IS the monitoring system breaking
   → Prometheus ran out of memory and restarted
   → On restart, it evaluates all rules and fires stale alerts
   → Check Prometheus/Alertmanager health before trusting the alerts.

5. Mute the noise, focus on the root:
   → Acknowledge the symptom alerts
   → Focus on the one cause alert
   → Fix the cause; symptoms resolve automatically