Debugging Methodology Footguns¶
- Fixing symptoms instead of root causes. The disk is full. You delete old logs. It fills up again in a week. You delete logs again. You do this monthly for a year. The root cause — a logging misconfiguration writing debug-level output to a production log — never gets addressed. You have turned an engineering problem into a recurring operations task.
Fix: After every fix, ask: "Will this problem come back?" If yes, you fixed the symptom. Apply the Five Whys to find the systemic cause. Fix both: the immediate symptom (clear disk) AND the root cause (fix log levels, add rotation).
- Tunnel vision — fixating on one hypothesis. You see a deployment happened 10 minutes before the error spike. You spend 3 hours analyzing that deployment. You roll it back. The problem persists. You wasted 3 hours because you never considered that a certificate also expired at the same time. You had one hypothesis and refused to let go.
Fix: Before testing anything, write down at least three hypotheses. Rank them by likelihood and ease of testing. When your current hypothesis survives two tests without confirmation, step back and reconsider the others. Actively seek evidence that disproves your favorite theory.
Remember: Richard Feynman's first principle of scientific inquiry: "You must not fool yourself, and you are the easiest person to fool." In debugging, confirmation bias is the default mode. You see evidence that supports your hypothesis and ignore evidence that contradicts it. The antidote is to explicitly try to disprove your theory before acting on it. If you can't disprove it after two honest attempts, confidence increases.
- Changing multiple variables at once. Under time pressure, you restart the service, increase memory limits, update the config, and rotate credentials — all at once. The service recovers. In the postmortem, you cannot explain what the root cause was because you changed four things simultaneously. You also cannot tell which changes are safe to revert.
Fix: Change one variable at a time. Test after each change. If time pressure makes this impossible during an incident, apply all fixes to restore service, then revert them one at a time afterward to identify which one was necessary. Document the finding.
- Not building a timeline. You jump straight into log analysis without establishing when the problem started. You grep through hours of logs looking for anything suspicious. You find errors that are actually normal. You miss the real error because it happened at a timestamp you did not think to check. Everything takes 3x longer.
Fix: First action: establish the timeline. When did the symptom start? (Check monitoring graphs for the inflection point.) What changed in the 30 minutes before that? (Deployments, config changes, cron jobs, traffic patterns.) Then search logs only in the relevant time window.
- Blaming the network without testing it. The application is failing. "Must be the network." You open a ticket with the network team. They investigate for 2 hours and find nothing. Meanwhile, the actual problem — a misconfigured environment variable pointing to the wrong database host — sits untouched. You wasted your time and the network team's time.
Fix: Test the network before blaming it: ping, traceroute, nc, curl. If packets reach the destination and ports are open, the network is working. Check the application layer: DNS resolution, connection strings, TLS handshakes, auth. The network is the problem far less often than people assume.
War story: A team spent 3 days debugging "network issues" between two microservices. Packet captures showed clean TCP handshakes and successful connections. The actual problem: the application was making DNS lookups for the service name, which resolved to a stale IP from a previous deployment. A single
nslookupwould have revealed the problem in 5 seconds. The lesson: always verify DNS resolution as step 1 when services can't communicate.
- Searching logs without knowing what you are looking for. You open the log viewer and start scrolling. Or you grep for "error" and get 50,000 results. You read through them hoping something jumps out. This is not debugging — it is hoping. It works occasionally, which makes it feel productive, but it is the least efficient approach.
Fix: Start with a hypothesis. "If the database connection pool is exhausted, I expect to see connection timeout errors in the application log between 14:00 and 14:15." Then search specifically for that evidence. Confirm or deny the hypothesis. Move to the next one. Targeted searches beat aimless scrolling every time.
- Confusing correlation with causation. A deployment happened at 13:55. Errors started at 14:00. You roll back the deployment. Errors persist. The deployment was not the cause — a background cron job that runs at 14:00 was the actual trigger. You wasted time rolling back and now you are running old code and still have the problem.
Fix: Correlation is a clue, not a conclusion. Before acting on a correlation, test causation: can you explain the mechanism from the change to the symptom? Does reverting the change fix the symptom? Can you reproduce the failure by making the change in isolation? Only act when you have at least two of these three.
- Not documenting what you have already tried. You try five things. None work. A colleague joins to help. They suggest something. You cannot remember if you already tried it. You try it again, wasting 20 minutes. Or worse, you skip it because you think you tried it, but you actually tried a subtly different variation.
Fix: Keep a running log during debugging. For each attempt: what you changed, what you expected, what actually happened. Share this log when handing off to another engineer. This is not bureaucracy — it is efficiency. A shared debugging log prevents duplicate work and preserves institutional memory.
- Rebooting as a first resort. Something is wrong. Reboot. It works. Problem "solved." Except: you destroyed all the evidence (process state, memory contents, network connections, kernel logs). If the problem was a memory leak, it will return in exactly the same amount of time. If it was a deadlock, you will never know what caused it because the state is gone.
Fix: Before rebooting, capture the state: take a heap dump, save /proc
information, copy logs, run diagnostic commands. Reboot only after you have collected
evidence OR the business impact makes immediate restoration mandatory. After
rebooting, investigate using the captured data. Rebooting without diagnostics is not
a fix — it is a delay.
-
Not writing a postmortem. The incident is resolved. Everyone is tired. Nobody wants to write up what happened. "We'll remember." You will not. Three months later, the same failure mode occurs. A new on-call engineer spends 4 hours rediscovering what the team already learned. No systemic improvements were made because nobody documented what needed to change.
Fix: Write a postmortem for every incident that exceeds 30 minutes or affects customers. It does not need to be long: timeline, root cause, fix, prevention items with owners and deadlines. Make postmortem review a team ritual. The postmortem is not punishment — it is the mechanism by which the organization learns.
Gotcha: The biggest postmortem failure mode is "action items with no owners and no deadlines." A postmortem that says "we should add monitoring" but doesn't assign a person and a date is just documentation of regret. Track action items in your project management tool with the same priority as feature work. Google's SRE book recommends that postmortem action items be treated as bugs — triaged, assigned, and tracked to completion.