Skip to content

Debugging Methodology

← Back to all decks

38 cards — 🟢 10 easy | 🟡 15 medium | 🔴 7 hard

🟢 Easy (10)

1. What are the five steps of the scientific method applied to debugging?

Show answer 1) Observe: what is actually happening (symptoms, not assumptions). 2) Hypothesize: what could cause this. 3) Predict: if hypothesis X is true, what else should be true. 4) Test: check the prediction, change one variable. 5) Conclude: confirmed or eliminated, then repeat.

Remember: mnemonic OH-PTC — Observe, Hypothesize, Predict, Test, Conclude. The same method finds bugs in DNA and bugs in code.

2. What is divide-and-conquer debugging and why is it more efficient than linear search?

Show answer Divide-and-conquer bisects the system at the midpoint and tests there, cutting the problem space in half with each test. For a 10-component pipeline, you need at most 4 tests (log2 of 10) instead of 10. It is binary search applied to infrastructure.

Example: 10-stage pipeline broken? Test stage 5. Works? Bug is in 6-10. Half the problem eliminated in one test.

Remember: binary search applied to systems — log2(N) tests instead of N sequential checks.

3. What is the Five Whys technique and why does the first answer rarely give the root cause?

Show answer The Five Whys is a root cause analysis technique where you keep asking "why" until you reach the systemic cause. The first "why" typically gives the symptom fix (e.g., kill the slow query). The fifth "why" gives the systemic fix (e.g., add automated index validation in migration pipeline) that prevents recurrence.

Example: Why down? Bad deploy. Why? Wrong config. Why? No validation. Why? No CI gate. Why? Never built one. Fix: add config validation to CI.

4. Why should you preserve evidence before attempting a fix?

Show answer Restarting services, clearing logs, or redeploying destroys the information needed to understand root cause. Save logs, record versions, snapshot config, and keep failing input before touching anything.

Remember: CSI rule — Capture State Immediately. Restarting destroys the crime scene. Save logs, snapshots, and config before touching anything.

Example: kubectl logs pod > /tmp/crash.log && kubectl describe pod > /tmp/describe.txt — THEN restart.

5. Why should you read an error message twice?

Show answer The first reading is emotional (panic, frustration). The second reading is analytical: extract the exact text, file/line/function, component name, timing, and whether the error is primary or secondary fallout.

Remember: first reading is emotional (panic). Second is analytical: extract file, line, component, timing, and whether the error is primary or secondary.

6. Why are print statements still an effective debugging technique?

Show answer They are cheap, local, require no setup, and brutally effective at revealing values, branches taken, timing, and request flow. They can be added anywhere instantly and removed just as easily.

Gotcha: use structured logging in production. But locally, print is zero-setup instant insight. Remove before committing.

7. Why is dstat a good first tool when a machine is misbehaving?

Show answer dstat shows CPU, disk, network, and memory stats updating in real time in a single view. It quickly answers "is the machine CPU-bound, disk-bound, memory-bound, or network-bound?" without switching between tools.

Remember: DCNM — dstat shows CPU, Disk, Network, Memory in one real-time view. Instantly answers "what kind of bottleneck?"

See also: modern alternatives include glances (Python), btop (interactive), dool (dstat fork on newer distros).

8. What three questions can lsof answer?

Show answer 1. What files is this process holding open? (lsof -p )
2. Who is listening on this port? (lsof -i :)
3. Why won't this filesystem unmount? (lsof /mount/point)

Remember: LSOF = List Some Open Files. On Linux everything is a file — sockets, pipes, devices. lsof sees them all.

9. How do you list all TCP connections with process info using ss?

Show answer ss -tp shows established TCP connections with the owning process. Add -l for listening sockets (ss -tlnp), -u for UDP (ss -unp).

Remember: ss = Socket Statistics. ss -tlnp = TCP Listening Numeric Process. Replaced netstat on modern Linux.

10. When should you check dmesg during debugging and what would you look for?

Show answer Check dmesg when processes are killed unexpectedly (OOM killer messages), hardware errors are suspected (disk I/O errors, NIC failures), or containers crash without application logs. Look for: "Out of memory: Killed process", "I/O error", "segfault at", or "hardware error".

Example: dmesg -T for human timestamps. Grep: "Out of memory", "I/O error", "segfault" — kernel evidence invisible to application logs.

🟡 Medium (15)

1. What five questions should you ask to generate debugging hypotheses?

Show answer 1) What changed recently? (deploys, config, infra, traffic). 2) What is different about the failing cases? (users, regions, endpoints, time). 3) What resources could be exhausted? (CPU, memory, disk, FDs, connections). 4) What dependencies could be failing? (DBs, caches, APIs, DNS, certs). 5) What has failed like this before? (incident history, postmortems).

Remember: WIRED — What changed, Is it intermittent, Resources exhausted, External deps failing, Done this before? Five hypothesis generators.

2. What is "shotgun debugging" and why is changing one variable at a time critical?

Show answer Shotgun debugging is changing multiple things at once hoping one helps. The problem is that if the issue resolves, you do not know which change actually fixed it, so you cannot prevent recurrence. Always change one variable at a time so you know which change had the effect.

Gotcha: changing 3 things and seeing a fix means 3 possible causes and zero understanding. Always change one variable at a time.

3. How do you distinguish correlation from causation when a deployment and a failure happen near the same time?

Show answer Three tests: 1) Revert the suspected change — if the problem goes away, strong evidence of causation. 2) Reproduce in isolation — can you trigger the failure by making only that change? 3) Explain the mechanism — can you trace from the change to the symptom step by step? Correlation is temporal proximity; causation requires a verifiable mechanism.

Example: CPU spikes at 3 PM when the cron runs, but the real cause is the database backup that also starts at 3 PM. Temporal proximity is not causation.

4. How does the network layer model help narrow down connectivity problems?

Show answer Test from L7 down: curl for application (L7), telnet/nc for transport (L4), ping for network (L3). If L3 works (ping succeeds) but L4 fails (cannot connect to port), the problem is narrowed to: firewall, security group, service not listening, or wrong port. Each layer test eliminates a class of causes.

Remember: test top-down — L7 curl, L4 nc/telnet, L3 ping. Each success eliminates a class of problems.

5. Why is reproduction the most important step in debugging?

Show answer Reproduction creates leverage: you can test hypotheses, validate fixes, and write regression tests. Without reproduction, debugging is guesswork and you cannot confirm the fix actually works.

Gotcha: without reproduction, debugging is guesswork. Reproduction turns debugging from art into science — you can verify fixes and write regression tests.

6. What categories of causes should you brainstorm when debugging?

Show answer Config changes, dependency behavior changes, recent code/deploy changes, race conditions, stale data or caches, time zone issues, permission changes, and bad assumptions in your mental model.

Remember: CCRD-CTPB — Config, Code/deploy, Race conditions, Dependencies, Caches/stale data, Time zones, Permission changes, Bad assumptions.

7. Why is writing a tiny reproducer valuable during debugging?

Show answer It strips away irrelevant complexity, making the bug's mechanism visible. A minimal reproducer also serves as the basis for a regression test and is easier to share when asking for help.

Example: reduce 500 lines to 10 that still fail. Now you can share it, understand it, and write a regression test from it.

8. How does finding a version that works help debugging?

Show answer A known-good baseline lets you narrow the search to what changed between working and broken. Use git bisect, deploy history, or version comparison to find the exact change that introduced the failure.

Example: "Worked yesterday" → git bisect between yesterday's and today's deploy. "Works on A not B" → diff their configs.

9. Why should you change only one variable at a time when debugging?

Show answer If you change multiple things and the problem resolves, you do not know which change fixed it. You cannot prevent recurrence, write a targeted test, or explain the root cause to others.

Remember: the scientific method demands controlled experiments. Multi-variable changes create mystery fixes you cannot explain or reproduce.

10. What kinds of problems is strace best at revealing?

Show answer File access errors (ENOENT, EACCES), network connection failures, missing libraries, permission problems, signal delivery, and slow syscalls. It shows every syscall a process makes with arguments and return values.

Example: strace -e trace=file reveals ENOENT. strace -e trace=network reveals connection failures. strace -c shows syscall time stats.

11. What does perf help you understand that top does not?

Show answer perf shows WHERE CPU time is going at the function level (hot functions, call stacks) and can trace kernel and user-space events. top only shows per-process CPU percentage.

Example: perf top = live function-level hotspots. perf record -g = call graph capture. perf stat = hardware counters (cache misses, mispredicts).

12. A process is leaking file descriptors. How would you confirm this and find what it is opening?

Show answer Check /proc//fd — each entry is a symlink to an open file or socket. Count them over time (ls /proc//fd | wc -l) to confirm growth. Read the symlink targets to see what is being leaked (ls -la /proc//fd). lsof -p gives the same data with more detail.

Gotcha: default fd limit is often 1024 (ulimit -n). FD leaks hit this limit causing "Too many open files" even with free disk and memory.

13. A service is slow but top shows low CPU usage. What does that tell you and what do you check next?

Show answer Low CPU with high latency means the process is waiting, not computing. It is likely I/O-bound or blocked on a lock/network call. Check: iostat (disk I/O saturation), ss -tp (connection state — many CLOSE_WAIT or SYN_SENT?), strace -p -e trace=network (what syscall is it stuck on?). If strace shows futex or poll, the process is idle waiting for something external.

Remember: low CPU + high latency = WAITING, not computing. Check I/O (iostat), connections (ss), blocked syscalls (strace). Process is stuck externally.

14. What do vmstat and iostat show that top does not?

Show answer vmstat shows memory, swap, I/O, and CPU stats per interval (reveals swapping and I/O wait). iostat shows per-device disk I/O statistics (throughput, queue depth, utilization per disk).

Example: vmstat 1 per-second: si/so columns reveal swapping. iostat -x 1 shows per-device %util and await (I/O latency ms).

15. How do you find which process is preventing a port from being reused?

Show answer lsof -i : shows all processes with that port open. This reveals zombie listeners, processes in TIME_WAIT, or unexpected services that grabbed the port first.

🔴 Hard (7)

1. How does git bisect use binary search to find a breaking commit, and what is its worst-case efficiency?

Show answer git bisect start, mark HEAD as bad and a known-good commit as good. Git checks out the midpoint; you test and mark it good or bad. Repeat until the exact breaking commit is found. For N commits, worst case is log2(N) tests — 25 commits need at most 5 tests instead of 25.

Example: 1000 commits? git bisect finds the breaking one in ~10 tests. Automate: git bisect run ./test.sh for fully hands-off binary search.

2. What are tunnel vision and confirmation bias in debugging, and how do you counter them?

Show answer Tunnel vision is fixating on one hypothesis and ignoring contradictory evidence. Confirmation bias is only looking for evidence that supports your theory. Counter both by: writing down at least 3 hypotheses before testing any, and actively seeking disconfirming evidence for your leading theory.

Remember: write THREE hypotheses before investigating ANY. Forces broader thinking and prevents latching onto the first idea.

3. What eight questions should you walk through before starting to debug, according to the debugging checklist?

Show answer 1) What is the actual symptom? 2) When did it start (exact timestamp)? 3) What changed around that time? 4) Who/what is affected (scope)? 5) Is it consistent or intermittent? 6) What have you already tried? 7) What are your hypotheses (list at least 3)? 8) What is the fastest test to eliminate a hypothesis?

Remember: SWITCH-HF — Symptom, When, whIch changed, sCope, Half-intermittent, tried, Hypotheses (3+), Fastest test.

4. What five things should you do after fixing a bug?

Show answer 1. Write a regression test.
2. Document the root cause.
3. Improve observability (logging, metrics, alerts).
4. Remove misleading logs or dead code.
5. Ask: what would have made this easier to diagnose?

Remember: RIDOC — Regression test, Improve observability, Document root cause, Obsolete misleading code, Consider what would have helped diagnose faster.

5. Name five strategies for getting unstuck during debugging.

Show answer Take a break (diffuse thinking), pair with someone, timebox the rabbit hole, explain the bug out loud (rubber duck), and verify the code running is actually the code you changed (stale deploys, wrong branch).

Remember: BPTED — Break (rest), Pair (fresh eyes), Timebox (15 min), Explain (rubber duck), Deploy check (right code running?).

6. What are common debugging mistakes that Linux tools can prevent?

Show answer Reaching for application logs only (instead of OS-level data), assuming "slow" means CPU (could be disk or network), assuming "network issue" without packet capture, ignoring /proc, and restarting services before gathering evidence.

Remember: ALARM — App logs only, Latency assumed CPU, Assuming network, Restart before evidence, Missing /proc.

7. You need to strace a production service handling 10K requests/sec. What precautions do you take?

Show answer strace uses ptrace which stops the process on every traced syscall — at 10K req/s the overhead is severe. Precautions: (1) Filter aggressively with -e trace=network or -e trace=file to minimize intercepted calls. (2) Write to file with -o, never to terminal. (3) Limit duration to 10-30 seconds max. (4) Consider perf trace instead (kernel tracepoints, ~5x less overhead). (5) If possible, trace a single worker thread with -p rather than the whole process with -f.

Remember: strace overhead = ptrace per syscall x request rate. At 10K req/s severe slowdown. Always filter (-e), limit duration, prefer perf trace.