Portal | Level: L1: Foundations | Topics: Debugging Methodology, Incident Response, Systems Thinking | Domain: DevOps & Tooling
Debugging Methodology - Primer¶
Why This Matters¶
You will spend more time debugging than building. That is not cynicism — it is arithmetic. Systems fail. Configs drift. Dependencies break. Services interact in ways nobody predicted. The difference between a senior engineer and a junior one is not that the senior encounters fewer problems. It is that the senior resolves them faster because they have a methodology.
Most engineers debug by instinct: poke at things, read logs, change stuff, hope it works. Sometimes it does. Often it does not, and the thrashing makes things worse. A systematic approach — hypothesis, test, eliminate, repeat — turns debugging from gambling into engineering.
The Scientific Method, Applied to Ops¶
Debugging is the scientific method applied to broken systems:
┌─────────────┐
│ 1. Observe │ What is actually happening?
└──────┬──────┘ (Symptoms, not assumptions)
│
┌──────▼──────┐
│2. Hypothesize│ What could cause this?
└──────┬──────┘ (Generate multiple hypotheses)
│
┌──────▼──────┐
│ 3. Predict │ If hypothesis X is true,
└──────┬──────┘ what else should be true?
│
┌──────▼──────┐
│ 4. Test │ Check the prediction.
└──────┬──────┘ Change ONE variable.
│
┌──────▼──────┐
│ 5. Conclude │ Confirmed or eliminated?
└──────┬──────┘ If eliminated, next hypothesis.
│
└──────────▶ Repeat until root cause found
In Practice¶
Observation: "The API is returning 502 errors."
Hypotheses (generate multiple before testing any): 1. The upstream service is down 2. The load balancer health check is failing 3. A recent deployment broke something 4. The upstream service is running but overloaded 5. DNS resolution changed
Prediction (for hypothesis 1): If the upstream is down, I should see it as not running in process list / container status.
Test: kubectl get pods -n api — pods are running, 1/1 Ready.
Conclusion: Hypothesis 1 eliminated. Move to hypothesis 2.
This feels slow at first. It is faster than thrashing.
Name origin: The word "debugging" is attributed to Grace Hopper, who in 1947 found an actual moth stuck in a relay of the Harvard Mark II computer. The moth was taped into the logbook with the note "First actual case of bug being found." The term "bug" for a fault predates computers — Thomas Edison used it in 1878 — but Hopper's story cemented it in computing lore.
Divide and Conquer¶
Complex systems have many components. You do not debug the whole system at once. You bisect it.
Request flow:
Client → DNS → Load Balancer → Ingress → Service A → Service B → Database
│
Where is it broken? │
│
Step 1: Test the midpoint │
Can Service A reach Service B? │
├── YES → Problem is between Client and Service A │
└── NO → Problem is between Service A and Database │
│
Step 2: Test the new midpoint │
Can Service B reach the Database? │
├── YES → Problem is between Service A and B │
└── NO → Problem is between Service B and Database │
│
Step 3: Continue bisecting until isolated │
This is binary search applied to infrastructure. Each test cuts the problem space in half. For a 10-component pipeline, you need at most 4 tests instead of 10.
Remember: Mnemonic for divide-and-conquer debugging: HALVE — Hypothesis, Assess midpoint, Left or right, Verify boundary, Eliminate half. Each iteration removes 50% of the search space.
The Layer Model¶
When you do not know where to start, work the layers:
Layer 7: Application curl -v http://service/health
Layer 4: Transport telnet service 8080 / nc -zv service 8080
Layer 3: Network ping service / traceroute service
Layer 2: Data Link arp -a / ip neigh show
Layer 1: Physical ethtool eth0 / cable check
If Layer 3 works (ping succeeds) but Layer 4 fails (cannot connect to port), the problem is between: firewall, security group, the service is not listening, or the port is wrong. You have narrowed from "the network" to a specific set of possibilities.
Hypothesis-Driven Debugging¶
The key discipline: generate hypotheses before you start testing. Write them down. Rank by likelihood and ease of testing.
Hypothesis Generation Framework¶
For any symptom, ask:
1. What changed recently?
└── Deployments, config changes, infra changes, traffic patterns
2. What is different about the failing cases?
└── Specific users, regions, endpoints, time of day
3. What resources could be exhausted?
└── CPU, memory, disk, file descriptors, connections, rate limits
4. What dependencies could be failing?
└── Databases, caches, external APIs, DNS, certificates
5. What has failed like this before?
└── Check incident history, postmortems, known issues
Ranking Hypotheses¶
Test the most likely and easiest-to-verify hypotheses first:
| Hypothesis | Likelihood | Test Difficulty | Priority |
|---|---|---|---|
| Recent deployment broke it | High | Easy (check deploy log) | 1st |
| Database connection pool exhausted | Medium | Easy (check metrics) | 2nd |
| DNS resolution changed | Low | Easy (dig/nslookup) | 3rd |
| Kernel bug triggered by traffic | Low | Hard (core dump analysis) | Last |
Binary Search for Root Cause¶
When a change caused the problem but you do not know which change:
Scenario: Something broke between Monday and Friday.
25 commits were merged.
Don't: Read all 25 commits hoping one looks suspicious.
Do: Binary search.
Commits: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
▲
Test commit 13
├── Works? → Bug is in 14-25
└── Broken? → Bug is in 1-13
This takes at most log2(25) = 5 tests instead of 25.
Under the hood:
git bisectuses an actual binary search algorithm. For N commits, it takes at most ceil(log2(N)) steps. With 1,000 commits, that is 10 tests. You can even automate it:git bisect run ./test.shwill run your script at each midpoint and mark good/bad automatically — no human interaction needed.
Git has this built in:
git bisect start
git bisect bad HEAD # Current state is broken
git bisect good abc123 # This commit was known good
# Git checks out the midpoint — test it
git bisect good # or: git bisect bad
# Repeat until git identifies the exact commit
git bisect reset # Return to original state
Correlation vs Causation¶
The most dangerous debugging trap: mistaking correlation for causation.
"The service started failing at 14:00."
"A deployment happened at 13:55."
Therefore: "The deployment caused the failure."
Maybe. Or maybe:
- A certificate expired at 14:00
- Traffic doubled at 14:00 (daily pattern)
- An upstream service deployed at 13:50
- A cronjob runs at 14:00 that consumes resources
Rule: A change correlates with a failure if they happen near each other in time. A change causes a failure only if reverting the change fixes the failure, or you can explain the mechanism.
How to test causation:¶
1. Revert the suspected change
└── If the problem goes away: strong evidence of causation
└── If the problem persists: correlation only
2. Reproduce in isolation
└── Can you trigger the failure by making only that change?
3. Explain the mechanism
└── Can you trace from the change to the symptom step by step?
The Five Whys¶
A root cause analysis technique. Keep asking "why" until you reach the systemic cause, not just the proximate one.
Problem: Production outage lasted 4 hours.
Why 1: Why was there an outage?
→ The database ran out of connections.
Why 2: Why did it run out of connections?
→ A query was running for 30 minutes and holding connections.
Why 3: Why was the query running for 30 minutes?
→ A missing index caused a full table scan on a 500M row table.
Why 4: Why was the index missing?
→ A migration dropped and recreated the table but forgot the index.
Why 5: Why did the migration miss the index?
→ There is no automated check that compares indexes before and after migrations.
Root cause: Missing automated index validation in the migration pipeline.
Fix: Add a CI step that compares schema (including indexes) before and after migrations.
Notice: the first "why" gives you the symptom fix (kill the query, increase connection limit). The fifth "why" gives you the systemic fix (prevent it from recurring). Both matter, but only the systemic fix prevents recurrence.
Who made it: The Five Whys was developed by Sakichi Toyoda and used within Toyota Motor Corporation during the 1930s. It became a core component of the Toyota Production System. Taiichi Ohno described it as "the basis of Toyota's scientific approach." The technique migrated to software via Lean manufacturing and is now standard practice in SRE postmortems at Google, Meta, and most tech companies.
Gotcha: The Five Whys can mislead if you follow a single causal chain. Real incidents usually have multiple contributing causes. If "Why 3" has two plausible answers, explore both branches. A better framing: "Five Whys, Multiple Branches" — draw a fault tree, not a single line.
Common Debugging Traps¶
| Trap | Description | Counter |
|---|---|---|
| Tunnel vision | Fixating on one hypothesis, ignoring evidence that contradicts it | Write down 3+ hypotheses before testing any |
| Shotgun debugging | Changing multiple things at once, hoping one helps | Change ONE variable at a time |
| Recency bias | Blaming the most recent change without evidence | Check correlation AND causation |
| Anchoring | First piece of information dominates your thinking | Re-evaluate hypotheses as new evidence emerges |
| Confirmation bias | Only looking for evidence that supports your theory | Actively seek disconfirming evidence |
| Blame routing | "It must be the network / the cloud / the other team" | Test the boundary between teams systematically |
The Debugging Checklist¶
Before you start thrashing, walk through this:
□ What is the actual symptom? (Not what you think the problem is)
□ When did it start? (Exact timestamp if possible)
□ What changed around that time? (Deploys, configs, traffic, externals)
□ Who/what is affected? (All users? Some? One region? One endpoint?)
□ Is it consistent or intermittent?
□ What have you already tried? (Document EVERY attempt)
□ What are your hypotheses? (List at least 3)
□ What is the fastest test to eliminate a hypothesis?
Key Takeaways¶
- Debugging is the scientific method: observe, hypothesize, predict, test, conclude
- Divide and conquer cuts the problem space in half with each test — do not search linearly
- Generate multiple hypotheses before testing any — avoid tunnel vision
- Correlation is not causation — a nearby change is a suspect, not a conviction
- Change one variable at a time — if you change three things and it works, you do not know which one fixed it
- The Five Whys finds systemic causes — the first answer is almost never the root cause
- Document everything you try — the attempt that did not work is data too
Wiki Navigation¶
Related Content¶
- Systems Thinking for Engineers (Topic Pack, L1) — Incident Response, Systems Thinking
- Change Management (Topic Pack, L1) — Incident Response
- Chaos Engineering Scripts (CLI) (Exercise Set, L2) — Incident Response
- Debugging Methodology Flashcards (CLI) (flashcard_deck, L1) — Debugging Methodology
- Incident Command & On-Call (Topic Pack, L2) — Incident Response
- Incident Response Flashcards (CLI) (flashcard_deck, L1) — Incident Response
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Incident Response
- Investigation Engine (CLI) (Exercise Set, L2) — Incident Response
- Ops War Stories & Pattern Recognition (Topic Pack, L2) — Incident Response
- Postmortems & SLOs (Topic Pack, L2) — Incident Response