Mental Model: Differential Diagnosis¶
Category: Debugging & Diagnosis Origin: Clinical medicine (formalized in the 19th century); adopted into engineering debugging culture through DevOps and SRE practices, popularized by the analogy to medical diagnosis in "The Practice of Cloud System Administration" and similar texts One-liner: Enumerate all plausible causes, rank them by likelihood and testability, then systematically eliminate candidates — ruling things out is as valuable as confirming them.
The Model¶
Differential diagnosis (diff-dx in medical shorthand) is the discipline of treating a problem as a hypothesis space rather than a single guess. When a system is broken, the instinct is to form a theory and test it. If the test is negative, form a new theory and test it again. This is random walk debugging. Differential diagnosis makes the hypothesis space explicit upfront: list every plausible cause, then work through the list efficiently.
The method has four stages. First, enumerate: generate all the hypotheses that could explain the observed symptoms. Do this without judgment — the goal is completeness. A rare cause you dismiss immediately is less dangerous than a rare cause you never thought of. Second, rank: order the list by the combination of likelihood (how probable is this cause given everything you know?) and testability (how cheaply can you confirm or rule this out?). "Common things are common" — start with high-probability, cheap-to-test hypotheses. Third, test: design a minimally invasive test for each hypothesis. A good test is one that, if negative, definitively rules out the cause — not one that merely makes it less likely. Fourth, eliminate: as tests return results, cross candidates off the list. The process ends either when you confirm a cause (positive test + understanding of mechanism) or when you've exhausted the list (which means your initial enumeration was incomplete — return to step one with broader creativity).
The "negative space" insight is underappreciated: ruling out a cause is not failure. Every eliminated hypothesis reduces the search space and increases confidence in the remaining candidates. An engineer who systematically rules out 9 of 10 hypotheses and identifies the 10th as the confirmed cause has done excellent diagnostic work, even though 9 of their 10 tests returned "no."
Differential diagnosis is most powerful in complex, interconnected systems where the symptom (packet loss between two hosts) could have many unrelated causes (firewall rule, ARP table corruption, MTU mismatch, NIC driver bug, switch VLAN misconfiguration, routing asymmetry). Without a structured approach, engineers jump to the first familiar cause, spend hours eliminating it, and only then move to the next — a fundamentally O(N) process that depends on the order hypotheses are thought of. Structured diff-dx with ranking approaches O(log N) when high-probability causes are cheap to test.
The model's limits: it assumes you can enumerate the hypothesis space usefully. For truly novel failures involving unknown interactions between systems, you may not know what to list. In those cases, use observability tools to gather more signals before applying diff-dx. Also, diff-dx can be slow if tests are expensive or time-consuming — in that case, prioritize the cheapest tests even if they test lower-probability causes.
Visual¶
Differential Diagnosis Workflow
────────────────────────────────────────────────────────────────────
SYMPTOM: TCP connections from Pod A to Service B timing out
Step 1: Enumerate (all plausible causes)
┌──────────────────────────────────────────────────────────────────┐
│ H1: kube-proxy rules missing for Service B │
│ H2: NetworkPolicy blocking Pod A → Service B │
│ H3: iptables rule (firewall) dropping packets │
│ H4: Service B pods are not ready / not running │
│ H5: DNS resolution failing (connecting to wrong IP) │
│ H6: MTU mismatch on the overlay network causing fragmentation │
│ H7: Service B application listening on wrong port │
│ H8: Node-level conntrack table full — dropping new connections │
└──────────────────────────────────────────────────────────────────┘
Step 2: Rank (likelihood × testability)
Priority │ Hypothesis │ Likelihood │ Test cost
──────────┼─────────────────┼────────────┼──────────
1 │ H4 (pods down) │ High │ Trivial (kubectl)
2 │ H2 (NetPolicy) │ High │ Low (kubectl get netpol)
3 │ H5 (DNS) │ Medium │ Low (nslookup from pod)
4 │ H1 (kube-proxy) │ Medium │ Medium (iptables-save)
5 │ H3 (iptables) │ Medium │ Medium (iptables -L)
6 │ H7 (wrong port) │ Low │ Low (kubectl describe svc)
7 │ H8 (conntrack) │ Low │ Low (sysctl check)
8 │ H6 (MTU) │ Low │ High (packet capture)
Step 3–4: Test and eliminate
✓ H4: kubectl get pods → Service B pods Running/Ready. RULED OUT.
✓ H2: kubectl get netpol → no NetworkPolicy in namespace. RULED OUT.
✓ H5: nslookup service-b → returns correct ClusterIP. RULED OUT.
✗ H1: iptables-save | grep service-b-clusterip → NO RULES FOUND. CONFIRMED.
Root cause: kube-proxy failed to program iptables rules for Service B.
When to Reach for This¶
- Network connectivity failures: "Host A cannot reach Host B" — the cause space is large and varied; explicit enumeration prevents thrashing
- Intermittent failures that don't reproduce cleanly — rule out the cheap candidates systematically rather than chasing each occurrence
- After a complex change (infrastructure migration, major version upgrade) with multiple things changing simultaneously — the symptom might have N possible causes from the change; enumerate them all
- When an incident has been open for more than an hour without clear progress — stop, write down all remaining hypotheses, rank, and work the list
- When you've ruled out the "obvious" causes and don't know what to try next — returning to the enumeration step with fresh hypotheses resets the search
- Security incidents: enumerate all possible vectors before committing to a theory of how the breach occurred
When NOT to Use This¶
- When a single obvious, high-confidence cause presents itself: if
kubectl get podsimmediately shows CrashLoopBackOff with an OOM kill event in the logs, differential diagnosis is unnecessary overhead — act on the clear signal - When you need to act immediately to mitigate ongoing impact: diff-dx is a diagnostic method, not a mitigation method — restore service first, diagnose second
- When the hypothesis space is completely unknown (early in a novel incident): spend time gathering signals and building observability before trying to enumerate hypotheses — half-formed hypotheses waste test effort
- For routine operational tasks that have a known procedure: if runbooks cover the failure mode, follow the runbook instead of re-deriving the hypothesis space each time
Building the Hypothesis List¶
The quality of differential diagnosis depends heavily on the quality of the initial hypothesis list. A list that omits the actual cause will never find it, regardless of how efficiently the ranking and testing proceed. Techniques for generating a complete list:
Decompose by layer. For network problems: physical (cable, NIC), data link (MAC, ARP, VLAN), network (IP, routing, NAT), transport (TCP, firewall, conntrack), application (TLS, DNS, protocol). For a connection failure, each layer is a candidate source. For storage problems: hardware (disk, controller), kernel driver, filesystem, volume manager, application I/O path.
Use the diff. What changed recently? Every change introduces a finite set of hypotheses: the thing that was changed, the things that depend on it, and the things that depend on those. A Kubernetes upgrade touches the API server, scheduler, kubelet, kube-proxy, CoreDNS, and the container runtime — all of these are candidates when something breaks after an upgrade.
Consider failure modes, not just components. Instead of listing components ("the database," "the network"), list failure modes: "the database is unavailable," "the database is available but returning errors," "the database is available but slow," "the database is returning stale data." Each failure mode suggests different tests.
Consult prior incidents. The most likely causes in your system are the ones that have caused problems before. A running list of post-mortem root causes, organized by symptom type, is the most valuable input to a new diff-dx session. "Last time we saw connection timeouts to that service, it was a conntrack table exhaustion" is a strong prior.
Don't omit low-probability causes if they are cheap to test. A misconfigured /etc/hosts entry is an unlikely cause of most connectivity problems, but it takes 2 seconds to check with getent hosts <hostname>. Running the cheap test first eliminates it quickly even if the prior probability is low.
Ranking Criteria in Detail¶
The ranking step (likelihood × testability) is where experienced engineers have a strong advantage. The judgment comes from exposure to many failure patterns. Some calibrating principles:
"Common things are common" from medicine. In a general Linux server environment, the most common causes of service degradation are: misconfigured firewall rules, resource exhaustion (memory, disk, file descriptors), DNS resolution failures, TLS certificate issues, and misconfigured or stale service configuration. These are high-prior candidates for any symptom.
Test cost includes blast radius. A test that could cause additional damage must be weighted as more expensive, regardless of its time cost. Restarting a service to see if it recovers is a test, but it's a disruptive one — do it only after cheaper, non-disruptive tests are exhausted.
A test that rules out multiple hypotheses simultaneously is worth extra weight. If you can design a test that rules out H2, H3, and H5 in one step, prioritize it over three separate cheaper tests that each eliminate one hypothesis. tcpdump on the affected host captures traffic at the network level — it simultaneously rules out (or confirms) hypotheses about DNS, TLS, firewall drops, and connection state.
Test in the order that maximizes information gain. If you have high confidence in H1 and H1 is fast to test, test it first. If you have low confidence in everything, test the hypothesis whose answer would most narrow the remaining space — usually the most differentiating test.
Applied Examples¶
Example 1: Intermittent packet drops — firewall shadow rule¶
A microservice is experiencing ~2% packet loss to one specific external endpoint. All other external endpoints are healthy. The loss is intermittent — not every connection, not on every host.
Enumerate hypotheses: - H1: Firewall rule blocking traffic to that specific IP or CIDR - H2: Network congestion on the path to that endpoint - H3: MTU mismatch causing fragmentation and drop - H4: Source IP SNAT pool exhaustion causing connection failures - H5: The remote endpoint rate-limiting the service's source IP - H6: A shadow/duplicate firewall rule with lower priority allowing some traffic but blocking others based on connection state
Rank: H1 (high likelihood — specific destination affected), H3 (medium — MTU issues are common on cloud overlays), H2 (medium — easy to test with ping), H6 (lower probability but cheap to check), H4 (medium — testable with connection tracking), H5 (low — requires contacting the remote party).
Test H1: iptables -L -n -v | grep <remote-ip> — no rule found. But test iptables-save | grep <remote-cidr> — finds a REJECT rule for the /24 CIDR that includes the remote IP. Confirmed H1 (H6 variant).
Root cause: A broad CIDR-based REJECT rule in the OUTPUT chain was installed during a security hardening pass, targeting a cloud provider CIDR that happened to include the legitimate endpoint. The shadow rule was in the FORWARD chain at a lower priority, allowing most traffic, but the OUTPUT chain rule caught traffic from the service host directly. Removing the overly broad rule restores connectivity.
Example 2: iptables blocking unexpected traffic — Linux service host¶
An internal monitoring agent cannot reach the local Prometheus node-exporter on port 9100. The exporter is running and binding on 0.0.0.0:9100. Connections from localhost work fine; connections from the monitoring server (10.0.0.50) fail.
Enumerate hypotheses: - H1: iptables INPUT rule blocking traffic from 10.0.0.50 - H2: The exporter is binding on localhost (127.0.0.1) not on all interfaces - H3: A firewall on the monitoring server is blocking the connection - H4: A cloud security group (if applicable) is blocking port 9100 - H5: The exporter is listening but returning connection resets (TLS mismatch, auth issue) - H6: Network route to/from 10.0.0.50 is missing or wrong
Rank: H2 is trivial and cheap. H1 is highly likely (iptables misconfiguration is common on hardened hosts). H4 is likely if on cloud.
Test H2: ss -tlnp | grep 9100 → LISTEN 0.0.0.0:9100. Not H2.
Test H1: iptables -L INPUT -n -v → Chain has a broad ACCEPT for established connections and a final DROP for everything not explicitly allowed. Port 9100 is not in the allowed list. Confirmed H1.
Fix: Add iptables -I INPUT -p tcp --dport 9100 -s 10.0.0.0/8 -j ACCEPT and persist with iptables-save. Monitoring agent connects successfully.
The Junior vs Senior Gap¶
| Junior | Senior |
|---|---|
| Forms one hypothesis and pursues it exhaustively before trying anything else | Lists all plausible hypotheses before testing any of them |
| Tests in the order hypotheses occur to them (recency bias, familiarity bias) | Explicitly ranks by likelihood × test cost before running the first test |
| Treats a negative test as "wasted time" | Treats a negative test as valuable information — it eliminates a candidate and increases confidence in the remainder |
| Stops when they find a cause that explains the symptom | Asks whether the confirmed cause fully explains all observed symptoms, or whether a second root cause is also present |
| Digs deeper into a familiar technology stack even when evidence points elsewhere | Follows the evidence even into unfamiliar territory |
| Spends 3 hours debugging the application before checking a firewall rule that would have taken 30 seconds | Starts with the cheapest tests regardless of familiarity |
Documenting and Reusing Hypothesis Lists¶
One of the highest-value habits for teams that apply differential diagnosis regularly is documenting their hypothesis lists as institutional knowledge. After an incident:
- Record the complete hypothesis list — not just the confirmed cause
- Note which hypotheses were tested first, how long each test took, and what the result was
- Note if the correct hypothesis was low on the initial ranking (it will help you recalibrate future priors)
Over time, this produces a hypothesis library organized by symptom type. "TCP connection timeouts to a Kubernetes service" should accumulate a list: kube-proxy rules missing, NetworkPolicy blocking, pod not ready, DNS wrong, conntrack table full, MTU mismatch, node network plugin issue. A new engineer inheriting an on-call rotation with this library can apply differential diagnosis as effectively as a senior engineer with years of pattern-matching experience.
Pre-built hypothesis libraries by symptom class:
Service not reachable (within Kubernetes): - Pod not running / not ready - Service selector does not match pod labels - kube-proxy / iptables rules not programmed - NetworkPolicy blocking - DNS not resolving to correct ClusterIP - Port mismatch (service port vs container port)
High latency (service response slow): - Downstream dependency slow (apply RED recursively) - Database query slow or connection pool exhausted - CPU throttling (cgroup limits) - Network congestion or packet loss on path - GC pause (JVM, Go, etc.) - Lock contention in application
High error rate: - Config change caused application bug - Dependency returning errors (apply RED to dependency) - Resource exhaustion (memory → OOM kill, disk → write failures) - TLS certificate expired - Rate limiting by upstream or downstream
Disk full:
- Log accumulation (check /var/log, journal)
- Core dumps
- Large temporary files from failed operations
- Docker image layers accumulation
- Database WAL files not being archived
These libraries are the encoded experience of the team. They make differential diagnosis fast even under incident pressure.
Connections¶
- Complements: USE Method (USE generates the candidate list for resource-based failures; Differential Diagnosis is the structured framework for eliminating them — USE is the enumeration step, diff-dx is the full workflow)
- Complements: Five Whys (Differential Diagnosis finds the immediate cause; Five Whys traces the causal chain back to root — use diff-dx first to confirm the cause, then Five Whys to understand the systemic failure)
- Tensions: Bisect (Bisect is optimal when you have an ordered change sequence and a binary test; Differential Diagnosis is better when the cause is unknown and there is no clear sequence to search — they are alternatives for different problem shapes)
- Topic Packs: networking, linux-networking
- Case Studies: firewall-shadow-rule (diff-dx enumeration catches the shadow iptables rule that a focused single-hypothesis approach would miss), iptables-blocking-unexpected (ranking by test cost leads to a 30-second iptables check confirming the hypothesis before any deep application debugging)