Debugging Methodology - Street-Level Ops¶

Quick Diagnosis Commands¶

# Establish a timeline — when did things start going wrong?
journalctl --since "1 hour ago" --priority=err
dmesg -T | tail -50
kubectl get events --sort-by='.lastTimestamp' -A | tail -30

# What changed recently?
# Git: last 10 commits
git log --oneline -10
# Kubernetes: recent deployments
kubectl rollout history deployment -A
# System: recently modified config files
find /etc -mmin -60 -type f 2>/dev/null

# Resource state snapshot
uptime                           # Load average
free -h                          # Memory
df -h                            # Disk
ss -s                            # Socket summary
cat /proc/sys/fs/file-nr         # File descriptor usage

# Network connectivity quick-check
curl -o /dev/null -s -w "%{http_code} %{time_total}s\n" http://service:8080/health
nc -zv database-host 5432 2>&1
dig +short service.namespace.svc.cluster.local

# Process state
ps aux --sort=-%cpu | head -10
ps aux --sort=-rss | head -10

# Recent errors in application logs
kubectl logs deployment/myapp --since=10m | grep -i -E "(error|fatal|panic|exception)" | tail -20

Pattern: Structured Triage Checklist¶

When paged at 3 AM, do not freestyle. Follow the checklist:

TRIAGE CHECKLIST (work top to bottom, skip nothing)

1. SCOPE
   □ What is the symptom? (Write it down — one sentence)
   □ Who reported it? (Monitoring, customer, internal team)
   □ When did it start? (Check monitoring for the inflection point)
   □ Who is affected? (All users, subset, one customer, one region)
   □ Is it getting worse, stable, or recovering?

2. CHANGES
   □ Deployments in the last 2 hours?
   □ Config changes in the last 2 hours?
   □ Infrastructure changes (scaling, migration, cert rotation)?
   □ Upstream/dependency changes?
   □ Traffic pattern changes (spike, drop)?

3. RESOURCES
   □ CPU: overloaded nodes/pods?
   □ Memory: OOMKills, swap usage?
   □ Disk: full filesystems, slow I/O?
   □ Network: packet loss, latency spikes?
   □ Connections: pool exhaustion, socket limits?

4. DEPENDENCIES
   □ Database: reachable, responsive, query latency?
   □ Cache: reachable, hit rates normal?
   □ External APIs: responding within SLA?
   □ DNS: resolving correctly?
   □ Certificates: valid, not expired?

5. DECISION
   □ Can we rollback? Should we?
   □ Is there a workaround?
   □ Do we need to escalate (vendor, other team)?
   □ What is our communication to stakeholders?

Gotcha: Fixing Symptoms Instead of Causes¶

The API is slow. You add more replicas. It gets faster. You declare victory. Next week it is slow again. You add more replicas. This cycle repeats until you are running 50 replicas of a service that should need 5.

The symptom was: slow API responses. The fix was: more replicas. The cause was: an N+1 query that fetches 500 rows per request instead of 1.

# Before scaling, check if the problem is resource-bound or logic-bound

# Is CPU the bottleneck?
kubectl top pods -n myapp
# If CPU is low but latency is high → not CPU-bound, scaling won't help

# Is it waiting on I/O?
kubectl exec myapp-pod -- cat /proc/1/status | grep voluntary_ctxt_switches
# High voluntary switches = process is waiting (I/O, locks, network)

# Check database query patterns
# Look for slow query log
kubectl logs deployment/myapp --since=10m | grep -i "slow\|query\|duration"

Rule: scaling is a valid fix for load problems. It is a band-aid for logic problems. Know which one you have before reaching for the replica count.

One-liner: If doubling the replicas halves the problem, it is a load issue. If doubling the replicas makes no difference, it is a logic issue.

Pattern: Timeline Reconstruction¶

The most powerful debugging technique for incidents is rebuilding the timeline:

INCIDENT TIMELINE — 2026-03-15

13:45 UTC  Deployment v2.4.1 rolled out to prod (CI/CD)
13:47 UTC  Health checks passing, deployment complete
14:00 UTC  Error rate increases from 0.1% to 2% (Grafana)
14:05 UTC  PagerDuty alert fires: error rate > 1%
14:07 UTC  On-call acknowledges alert
14:08 UTC  Checked recent deployments — v2.4.1 deployed 23 min ago
14:10 UTC  Checked v2.4.1 changelog — new payment endpoint added
14:12 UTC  Error logs show: "connection refused to payments-db:5432"
14:13 UTC  Checked payments-db — running, accepting connections
14:15 UTC  Checked network policy — new deployment needs port 5432 egress
14:15 UTC  HYPOTHESIS: deployment v2.4.1 calls payments-db but has no
           network policy allowing egress to it
14:17 UTC  Applied network policy allowing egress to payments-db
14:18 UTC  Error rate drops to 0.1%
14:20 UTC  Confirmed resolution. Root cause: missing network policy for
           new service dependency.

How to build a timeline:

# Deployment history
kubectl rollout history deployment/myapp -n prod

# Event stream
kubectl get events -n prod --sort-by='.lastTimestamp' | \
  grep -E "$(date -u -d '2 hours ago' +%H)"

# Log timestamps for errors
kubectl logs deployment/myapp -n prod --since=2h | \
  grep -i error | head -20

# Git commits around the incident time
git log --after="2026-03-15T12:00" --before="2026-03-15T16:00" --oneline

# Infrastructure changes (Terraform, Ansible)
# Check CI/CD pipeline history for the relevant time window

Gotcha: Changing Multiple Variables at Once¶

The service is broken. You simultaneously: restart it, increase memory limits, change a config value, and update the database connection string. It works now. But you do not know which change fixed it. Next time one of those things drifts, you will not know which one to check.

Discipline:

1. Change ONE thing
2. Test
3. If fixed → document what fixed it and why
4. If not fixed → revert the change, move to next hypothesis

If time pressure makes this impossible (Sev1):
1. Apply all suspected fixes to restore service
2. AFTER restoration, revert them ONE AT A TIME
3. Identify which one was actually necessary
4. Document in the postmortem

Pattern: Change Correlation¶

When something breaks, the first question is always: what changed?

# System-level changes
rpm -qa --last | head -20                    # Recent package installs
find /etc -mmin -120 -type f 2>/dev/null     # Config files changed in last 2h

# Kubernetes changes
kubectl rollout history deployment -A         # Deployment rollouts
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

# Cloud infrastructure changes (AWS example)
aws cloudtrail lookup-events \
  --start-time "2026-03-15T12:00:00Z" \
  --end-time "2026-03-15T16:00:00Z" \
  --query 'Events[].{Time:EventTime,Name:EventName,User:Username}'

# DNS changes
dig +short myservice.example.com
# Compare against expected value
# Check TTL — if low, it may have just changed

# Certificate expiry
echo | openssl s_client -connect myservice:443 2>/dev/null | \
  openssl x509 -noout -dates

Pattern: Blast Radius Assessment¶

Before diving into root cause, understand the impact:

BLAST RADIUS QUICK ASSESSMENT

□ How many users affected?
  └── Check error rate as percentage of total requests

□ Which services affected?
  └── Check service dependency map, trace downstream

□ Is data at risk?
  └── Check write paths — are writes failing silently?

□ Is it spreading?
  └── Compare error rates across services/regions over time

□ What is the business impact?
  └── Revenue, SLA credits, customer trust

□ Can we contain it?
  └── Feature flag, traffic shift, circuit breaker, rollback

# Error rate by endpoint
kubectl logs deployment/myapp --since=30m | \
  grep -oP 'status=\K\d+' | sort | uniq -c | sort -rn

# Affected services (if you have distributed tracing)
# Check your Jaeger/Tempo UI for traces with errors

# Rollback readiness
kubectl rollout undo deployment/myapp -n prod --dry-run=client
# If dry-run looks clean, you know rollback is an option

Gotcha: Blaming the Network Without Evidence¶

"It must be the network" is the most common misdirection in debugging. Every team blames the network because it is shared, opaque, and hard to disprove.

Before blaming the network:

# Test basic connectivity
ping -c 5 target-host
traceroute target-host

# Test specific port
nc -zv target-host 8080

# Check for packet loss
mtr -c 100 --report target-host

# Check DNS resolution
dig +short target-host
# Is it returning the right IP?

# Check from INSIDE the pod (Kubernetes)
kubectl exec debug-pod -- wget -qO- http://target-service:8080/health

# If all of this works, the network is not the problem

The network is the problem when: ping fails, traceroute shows hops dropping packets, mtr shows loss, or nc cannot connect to a port that is confirmed listening on the target.

Remember: "Blaming the network" without evidence is the ops equivalent of "it works on my machine." Before opening a ticket with the network team, have concrete data: packet captures, traceroute output, or mtr reports. Network teams receive so many false blame reports that they will ignore you without evidence.

Pattern: Rubber Duck Debugging¶

When you are stuck, explain the problem out loud (or in writing). The act of articulating the problem often reveals the gap in your understanding.

Structured version for ops:

Write in your incident channel:

"The symptom is [X].
I believe the cause is [Y] because [evidence].
I tested this by [action] and observed [result].
This is inconsistent with my hypothesis because [contradiction].
What I have not yet checked is [gap]."

The gap you identify in the last line is usually
where the answer is hiding.

This works because debugging failures are almost always reasoning failures, not knowledge failures. You already have the information — you just have not connected it yet. Forcing yourself to structure the narrative exposes the broken link.

Gotcha: No Timeline, No Postmortem¶

The incident is resolved. Everyone goes back to work. Nobody writes down what happened. Three months later, the same thing happens. Nobody remembers the resolution because it was only in Slack messages that have scrolled off the screen.

Minimum viable postmortem:

INCIDENT: [one-line description]
DATE: [date]
DURATION: [start to resolution]
IMPACT: [who/what was affected]
TIMELINE: [timestamped events]
ROOT CAUSE: [the actual cause, not the symptom]
FIX: [what resolved it]
PREVENTION: [what systemic change prevents recurrence]
ACTION ITEMS:
  □ [specific task] — [owner] — [due date]
  □ [specific task] — [owner] — [due date]

No postmortem means no organizational learning. The same incident will repeat, and the same engineer will debug it from scratch.

Quick Reference¶

Cheatsheet: Troubleshooting Flows