Portal | Level: L1: Foundations | Topics: Debugging Methodology, Incident Response, Systems Thinking | Domain: DevOps & Tooling

Debugging Methodology - Primer¶

Why This Matters¶

You will spend more time debugging than building. That is not cynicism — it is arithmetic. Systems fail. Configs drift. Dependencies break. Services interact in ways nobody predicted. The difference between a senior engineer and a junior one is not that the senior encounters fewer problems. It is that the senior resolves them faster because they have a methodology.

Most engineers debug by instinct: poke at things, read logs, change stuff, hope it works. Sometimes it does. Often it does not, and the thrashing makes things worse. A systematic approach — hypothesis, test, eliminate, repeat — turns debugging from gambling into engineering.

The Scientific Method, Applied to Ops¶

Debugging is the scientific method applied to broken systems:

┌─────────────┐
│  1. Observe │     What is actually happening?
└──────┬──────┘     (Symptoms, not assumptions)
       │
┌──────▼──────┐
│2. Hypothesize│    What could cause this?
└──────┬──────┘     (Generate multiple hypotheses)
       │
┌──────▼──────┐
│  3. Predict │     If hypothesis X is true,
└──────┬──────┘     what else should be true?
       │
┌──────▼──────┐
│   4. Test   │     Check the prediction.
└──────┬──────┘     Change ONE variable.
       │
┌──────▼──────┐
│ 5. Conclude │     Confirmed or eliminated?
└──────┬──────┘     If eliminated, next hypothesis.
       │
       └──────────▶ Repeat until root cause found

In Practice¶

Observation: "The API is returning 502 errors."

Hypotheses (generate multiple before testing any): 1. The upstream service is down 2. The load balancer health check is failing 3. A recent deployment broke something 4. The upstream service is running but overloaded 5. DNS resolution changed

Prediction (for hypothesis 1): If the upstream is down, I should see it as not running in process list / container status.

Test: kubectl get pods -n api — pods are running, 1/1 Ready.

Conclusion: Hypothesis 1 eliminated. Move to hypothesis 2.

This feels slow at first. It is faster than thrashing.

Name origin: The word "debugging" is attributed to Grace Hopper, who in 1947 found an actual moth stuck in a relay of the Harvard Mark II computer. The moth was taped into the logbook with the note "First actual case of bug being found." The term "bug" for a fault predates computers — Thomas Edison used it in 1878 — but Hopper's story cemented it in computing lore.

Divide and Conquer¶

Complex systems have many components. You do not debug the whole system at once. You bisect it.

Request flow:

Client → DNS → Load Balancer → Ingress → Service A → Service B → Database
                                                                      │
Where is it broken?                                                   │
                                                                      │
Step 1: Test the midpoint                                             │
   Can Service A reach Service B?                                     │
   ├── YES → Problem is between Client and Service A                  │
   └── NO  → Problem is between Service A and Database               │
                                                                      │
Step 2: Test the new midpoint                                         │
   Can Service B reach the Database?                                  │
   ├── YES → Problem is between Service A and B                      │
   └── NO  → Problem is between Service B and Database               │
                                                                      │
Step 3: Continue bisecting until isolated                             │

This is binary search applied to infrastructure. Each test cuts the problem space in half. For a 10-component pipeline, you need at most 4 tests instead of 10.

Remember: Mnemonic for divide-and-conquer debugging: HALVE — Hypothesis, Assess midpoint, Left or right, Verify boundary, Eliminate half. Each iteration removes 50% of the search space.

The Layer Model¶

When you do not know where to start, work the layers:

Layer 7: Application    curl -v http://service/health
Layer 4: Transport      telnet service 8080 / nc -zv service 8080
Layer 3: Network        ping service / traceroute service
Layer 2: Data Link      arp -a / ip neigh show
Layer 1: Physical       ethtool eth0 / cable check

If Layer 3 works (ping succeeds) but Layer 4 fails (cannot connect to port), the problem is between: firewall, security group, the service is not listening, or the port is wrong. You have narrowed from "the network" to a specific set of possibilities.

Hypothesis-Driven Debugging¶

The key discipline: generate hypotheses before you start testing. Write them down. Rank by likelihood and ease of testing.

Hypothesis Generation Framework¶

For any symptom, ask:

1. What changed recently?
   └── Deployments, config changes, infra changes, traffic patterns

2. What is different about the failing cases?
   └── Specific users, regions, endpoints, time of day

3. What resources could be exhausted?
   └── CPU, memory, disk, file descriptors, connections, rate limits

4. What dependencies could be failing?
   └── Databases, caches, external APIs, DNS, certificates

5. What has failed like this before?
   └── Check incident history, postmortems, known issues

Ranking Hypotheses¶

Test the most likely and easiest-to-verify hypotheses first:

Hypothesis	Likelihood	Test Difficulty	Priority
Recent deployment broke it	High	Easy (check deploy log)	1st
Database connection pool exhausted	Medium	Easy (check metrics)	2nd
DNS resolution changed	Low	Easy (dig/nslookup)	3rd
Kernel bug triggered by traffic	Low	Hard (core dump analysis)	Last

Binary Search for Root Cause¶

When a change caused the problem but you do not know which change:

Scenario: Something broke between Monday and Friday.
25 commits were merged.

Don't: Read all 25 commits hoping one looks suspicious.

Do: Binary search.

Commits: 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
                              ▲
                         Test commit 13
                         ├── Works? → Bug is in 14-25
                         └── Broken? → Bug is in 1-13

This takes at most log2(25) = 5 tests instead of 25.

Under the hood: git bisect uses an actual binary search algorithm. For N commits, it takes at most ceil(log2(N)) steps. With 1,000 commits, that is 10 tests. You can even automate it: git bisect run ./test.sh will run your script at each midpoint and mark good/bad automatically — no human interaction needed.

Git has this built in:

git bisect start
git bisect bad HEAD              # Current state is broken
git bisect good abc123           # This commit was known good
# Git checks out the midpoint — test it
git bisect good                  # or: git bisect bad
# Repeat until git identifies the exact commit
git bisect reset                 # Return to original state

Correlation vs Causation¶

The most dangerous debugging trap: mistaking correlation for causation.

"The service started failing at 14:00."
"A deployment happened at 13:55."
Therefore: "The deployment caused the failure."

Maybe. Or maybe:
- A certificate expired at 14:00
- Traffic doubled at 14:00 (daily pattern)
- An upstream service deployed at 13:50
- A cronjob runs at 14:00 that consumes resources

Rule: A change correlates with a failure if they happen near each other in time. A change causes a failure only if reverting the change fixes the failure, or you can explain the mechanism.

How to test causation:¶

1. Revert the suspected change
   └── If the problem goes away: strong evidence of causation
   └── If the problem persists: correlation only

2. Reproduce in isolation
   └── Can you trigger the failure by making only that change?

3. Explain the mechanism
   └── Can you trace from the change to the symptom step by step?

The Five Whys¶

A root cause analysis technique. Keep asking "why" until you reach the systemic cause, not just the proximate one.

Problem: Production outage lasted 4 hours.

Why 1: Why was there an outage?
→ The database ran out of connections.

Why 2: Why did it run out of connections?
→ A query was running for 30 minutes and holding connections.

Why 3: Why was the query running for 30 minutes?
→ A missing index caused a full table scan on a 500M row table.

Why 4: Why was the index missing?
→ A migration dropped and recreated the table but forgot the index.

Why 5: Why did the migration miss the index?
→ There is no automated check that compares indexes before and after migrations.

Root cause: Missing automated index validation in the migration pipeline.
Fix: Add a CI step that compares schema (including indexes) before and after migrations.

Notice: the first "why" gives you the symptom fix (kill the query, increase connection limit). The fifth "why" gives you the systemic fix (prevent it from recurring). Both matter, but only the systemic fix prevents recurrence.

Who made it: The Five Whys was developed by Sakichi Toyoda and used within Toyota Motor Corporation during the 1930s. It became a core component of the Toyota Production System. Taiichi Ohno described it as "the basis of Toyota's scientific approach." The technique migrated to software via Lean manufacturing and is now standard practice in SRE postmortems at Google, Meta, and most tech companies.

Gotcha: The Five Whys can mislead if you follow a single causal chain. Real incidents usually have multiple contributing causes. If "Why 3" has two plausible answers, explore both branches. A better framing: "Five Whys, Multiple Branches" — draw a fault tree, not a single line.

Common Debugging Traps¶

Trap	Description	Counter
Tunnel vision	Fixating on one hypothesis, ignoring evidence that contradicts it	Write down 3+ hypotheses before testing any
Shotgun debugging	Changing multiple things at once, hoping one helps	Change ONE variable at a time
Recency bias	Blaming the most recent change without evidence	Check correlation AND causation
Anchoring	First piece of information dominates your thinking	Re-evaluate hypotheses as new evidence emerges
Confirmation bias	Only looking for evidence that supports your theory	Actively seek disconfirming evidence
Blame routing	"It must be the network / the cloud / the other team"	Test the boundary between teams systematically

The Debugging Checklist¶

Before you start thrashing, walk through this:

□ What is the actual symptom? (Not what you think the problem is)
□ When did it start? (Exact timestamp if possible)
□ What changed around that time? (Deploys, configs, traffic, externals)
□ Who/what is affected? (All users? Some? One region? One endpoint?)
□ Is it consistent or intermittent?
□ What have you already tried? (Document EVERY attempt)
□ What are your hypotheses? (List at least 3)
□ What is the fastest test to eliminate a hypothesis?

Key Takeaways¶

Debugging is the scientific method: observe, hypothesize, predict, test, conclude
Divide and conquer cuts the problem space in half with each test — do not search linearly
Generate multiple hypotheses before testing any — avoid tunnel vision
Correlation is not causation — a nearby change is a suspect, not a conviction
Change one variable at a time — if you change three things and it works, you do not know which one fixed it
The Five Whys finds systemic causes — the first answer is almost never the root cause
Document everything you try — the attempt that did not work is data too

Systems Thinking for Engineers (Topic Pack, L1) — Incident Response, Systems Thinking
Change Management (Topic Pack, L1) — Incident Response
Chaos Engineering Scripts (CLI) (Exercise Set, L2) — Incident Response
Debugging Methodology Flashcards (CLI) (flashcard_deck, L1) — Debugging Methodology
Incident Command & On-Call (Topic Pack, L2) — Incident Response
Incident Response Flashcards (CLI) (flashcard_deck, L1) — Incident Response
Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Incident Response
Investigation Engine (CLI) (Exercise Set, L2) — Incident Response
Ops War Stories & Pattern Recognition (Topic Pack, L2) — Incident Response
Postmortems & SLOs (Topic Pack, L2) — Incident Response

Debugging Methodology - Primer¶

Why This Matters¶

The Scientific Method, Applied to Ops¶

In Practice¶

Divide and Conquer¶

The Layer Model¶

Hypothesis-Driven Debugging¶

Hypothesis Generation Framework¶

Ranking Hypotheses¶

Binary Search for Root Cause¶

Correlation vs Causation¶

How to test causation:¶

The Five Whys¶

Common Debugging Traps¶

The Debugging Checklist¶

Key Takeaways¶

Wiki Navigation¶

Pages that link here¶

Debugging Methodology - Primer¶

Why This Matters¶

The Scientific Method, Applied to Ops¶

In Practice¶

Divide and Conquer¶

The Layer Model¶

Hypothesis-Driven Debugging¶

Hypothesis Generation Framework¶

Ranking Hypotheses¶

Binary Search for Root Cause¶

Correlation vs Causation¶

How to test causation:¶

The Five Whys¶

Common Debugging Traps¶

The Debugging Checklist¶

Key Takeaways¶

Wiki Navigation¶

Related Content¶

Pages that link here¶