Mental Model: Correlation vs Causation¶

Category: Debugging & Diagnosis Origin: Classical statistics and scientific method (Francis Bacon, 17th century); the specific formulation "correlation does not imply causation" is attributed to various statisticians in the early 20th century; directly applied to incident analysis in modern SRE practice One-liner: Two events happening together — or one following the other — does not mean one caused the other; establishing causation requires reproducibility, controlled variables, and a plausible mechanism.

The Model¶

The most common debugging mistake is not a technical one — it is a logical one. A deployment happened at 14:00. The error rate spiked at 14:03. Therefore the deployment caused the error spike. This inference feels ironclad, but it is not. Correlation (two events near each other in time) does not establish causation. The deployment might be entirely unrelated; the error spike might have a different cause that happened to occur at the same time. Or the deployment might be one of two causes, both required together, and understanding only one of them will result in an incomplete fix.

Three conditions are required to establish causation, not just correlation:

1. Covariation: The cause and effect must vary together (correlation is necessary, not sufficient). If X causes Y, then when X is present, Y should follow.

2. Temporal precedence: The cause must precede the effect. A symptom that appeared before the proposed cause rules out that cause immediately. Always establish the precise timeline.

3. Elimination of alternative explanations: All other plausible causes must be ruled out. This is where Differential Diagnosis becomes essential — the correlation tells you where to look, but elimination is what establishes causation.

The fourth, informal condition is mechanism: you should be able to describe how the proposed cause produces the effect. "The deploy caused the outage" is weaker than "The deploy introduced a database migration that held a table lock for 45 seconds, blocking all writes, causing request queuing and timeouts." The mechanism makes the causal claim falsifiable and testable.

In operational environments, the MTTR pressure is the enemy of this discipline. When a production outage is ongoing, teams want a cause now so they can fix it. This pressure leads to premature attribution — the first plausible correlation becomes the accepted cause. The fix is applied, the system recovers (often because of the fix, sometimes spontaneously), and the attributed cause is recorded in the post-mortem. If the attribution was wrong, the actual cause remains unaddressed and the incident recurs.

Clock skew is a specific, dangerous form of spurious correlation in distributed systems. If host A's clock is 5 minutes ahead and host B's is accurate, log correlation will show events on host A occurring 5 minutes before their actual time. What looks like "A caused B" may be simultaneous events, or even B causing A. Always verify NTP synchronization before doing timeline analysis across multiple systems.

Visual¶

Establishing Causation — Three Lenses

LENS 1: Timeline (Temporal Precedence)
────────────────────────────────────────────────
  ✓ Cause before effect:   [Deploy 14:00] → [Errors 14:03]  plausible
  ✗ Effect before cause:   [Errors 13:57] → [Deploy 14:00]  deploy is NOT the cause
  ✗ Simultaneous:          [Deploy 14:00] = [Errors 14:00]  suspect a third factor

LENS 2: Reproducibility
────────────────────────────────────────────────
  Does re-running the deploy (in staging) reproduce the error?
  ✓ YES → strong causal evidence
  ✗ NO  → deploy may be coincidental; look for other causes

LENS 3: Mechanism
────────────────────────────────────────────────
  Can you describe the physical/logical path from cause to effect?

  Weak:  "The deploy caused the cert failure"
  Strong: [Deploy changed node hostnames]
              → [hostnames no longer match cert SAN]
              → [TLS handshakes fail]
              → [cert errors cascade to auth failures]

Common false correlation patterns:
┌─────────────────────────────────────────────────────────────────┐
│ Pattern          │ Example                       │ Risk          │
├─────────────────────────────────────────────────────────────────┤
│ Temporal overlap │ Deploy + unrelated CDN outage  │ Blame deploy  │
│ Clock skew       │ NTP drift makes A appear first │ Wrong order   │
│ Common cause     │ Traffic spike causes both      │ A "causes" B  │
│                  │ high CPU AND high errors        │ (both effects)│
│ Confirmation     │ Only look for evidence that    │ Miss the real │
│ bias             │ confirms deploy caused it       │ cause         │
└─────────────────────────────────────────────────────────────────┘

When to Reach for This¶

Any time a deploy, config change, or infrastructure change precedes a degradation — correlation is a starting hypothesis, not a conclusion
During post-mortem analysis when a cause has been assumed but not proven — ask explicitly: what is the evidence for causation, not just correlation?
When two metrics rise together and the team assumes one is causing the other — both may be effects of a third cause
Timeline analysis across multiple systems with separate log sources — verify clock synchronization before drawing causal conclusions
When an incident has resolved spontaneously after a fix was applied — did the fix cause the resolution, or did the system self-heal while the fix was being applied? (Post hoc ergo propter hoc fallacy)
When an established "fact" about a system's failure mode is based on a single incident — one data point is correlation, not causation

When NOT to Use This¶

As a reason to delay acting on a strong causal hypothesis during an active incident: if the evidence strongly points to a deploy and rolling it back is low-risk, act — you can refine the causal analysis afterward
When the mechanism is already established by code inspection: if you read the diff and can see exactly how the change causes the failure, you don't need to apply the full correlation-vs-causation framework — the mechanism is your proof
As a philosophical blocker: "we can't be certain it's causal" should not prevent action when the evidence is strong and the action is safe to take; perfect causal certainty is rare in production systems
For trivially obvious causes: if a server's power cable is unplugged and the server is off, you don't need to establish causation through the three-lens framework

Causal Inference Techniques for Engineers¶

Formal statisticians have developed rigorous methods for establishing causation from observational data. Engineers rarely need the full apparatus, but the principles are useful:

Controlled experiment (A/B test): The gold standard. Hold everything constant, change only the proposed cause, observe the effect. In infrastructure: deploy the change to 5% of servers (not 100%), observe whether those servers show the effect and others do not. If yes, causation is well-supported. If the effect appears on both cohorts, the cause is something else affecting all servers.

Natural experiment: Sometimes the environment provides a controlled comparison without you designing it. Two identical services deployed to different regions, one region upgraded and one not — if the effect appears only in the upgraded region, that's strong causal evidence. Look for these natural experiments in your infrastructure before designing artificial ones.

Interrupted time series: If you have a metric measured continuously over time, and a clear intervention at time T, causal inference looks at whether the level or trend of the metric changes at T in a way that exceeds the variance expected from normal fluctuation. This is more rigorous than eyeballing a graph, and it accounts for pre-existing trends.

Dose-response relationship: If the proposed cause is quantitative, does more of it produce more of the effect? More traffic → more errors (consistent with causation). More traffic → same errors (suggests traffic is not causal). A dose-response relationship is a strong causal indicator even from observational data.

Mechanism as evidence: The presence of a plausible, specific, verifiable mechanism is the most practical causal argument in engineering. "The config change set max_connections=10 which caused connection pool exhaustion at the observed traffic rate" is a mechanistic argument. You can verify it: check the config value, check the connection pool metric, check whether the connection exhaustion coincides with the error spike.

Common Confounders in Production Systems¶

A confounder is a third variable that causes both the apparent cause and the apparent effect, creating the illusion of a direct relationship:

Traffic volume is the most common confounder. Traffic spikes cause CPU utilization to rise (signal A) and error rates to rise (signal B). Engineers see CPU up + errors up and conclude "high CPU causes errors." In reality, both are effects of the traffic spike. The test: does error rate stay elevated when CPU drops back to normal? If yes, they're independently caused by something else. If no, traffic was the confounder.

Maintenance windows create temporal correlation between all changes made during the window. If five changes were deployed in a maintenance window and the system failed afterward, all five are temporally correlated with the failure. Only one (or some combination) is causal. Diff-dx across all five changes is required.

Time-of-day patterns create spurious correlations. "Our errors always spike at 3am" might be caused by a scheduled batch job that runs at 3am — the time is a confounder, not a cause. The actual cause is the batch job.

Software dependencies create shared-cause scenarios. Service A and Service B both depend on a database. The database slows down. Both A and B show high duration. Engineers working on A conclude "B is slow and B is upstream of A, so B is causing our slowness." Actually both are effects of the database. The fix is in the database, not in B.

Clock skew across systems is a confounder that operates at the observability layer, not the system layer. When log timestamps are unreliable, the apparent temporal ordering of events is unreliable, and any causal inference based on that ordering is suspect.

Applied Examples¶

Example 1: BMC clock skew causes spurious cert failure correlation — datacenter¶

A Kubernetes cluster in a datacenter starts rejecting mTLS connections between the API server and worker nodes. The event log shows certificate errors appearing at 09:47. A firmware update was applied to the BMC (Baseboard Management Controller) at 09:51, four minutes after the cert errors.

Initial (wrong) hypothesis: Someone on the team notes "there was a BMC firmware update this morning" and assumes it caused the cert failure. But temporal precedence check fails — the cert errors appear before the firmware update in the logs.

Clock skew investigation: The team checks NTP status on the BMC vs the main hosts. The BMC clock was drifting 4 minutes fast. All events logged through the BMC-provided timestamp appear 4 minutes earlier than their actual occurrence. The firmware update happened at 09:47 actual time; the cert errors happened at 09:51 actual time.

Corrected timeline: Firmware update → cert errors. Temporal precedence is satisfied. Mechanism investigation: the firmware update reset the BMC's RTC (real-time clock) to epoch, causing a 4-minute backwards jump in system time. Certificates with a NotBefore field in the future (from the node's perspective) were rejected by the API server.

Lesson: Log timestamps from systems without verified NTP sync can reverse the apparent causal order of events. Always verify clock sync before establishing a timeline. The BMC clock skew made the cert failure look like the cause of the firmware update rather than the effect.

Example 2: Time sync loss breaks application authentication — Linux host¶

An application on a Linux host starts failing authentication with an external OAuth provider at 02:15. An engineer notes that a cron job runs at 02:00 that updates application configuration and suspects it caused the auth failure. Both events are in the same 30-minute window.

Correlation noted: Config update at 02:00, auth failure at 02:15. Temporal precedence is satisfied.

Reproducibility test: Rolling back the config change in a staging environment does not reproduce the auth failure. The config is not the cause.

Mechanism check: What mechanism would a config change have on OAuth authentication? The changed config keys are for UI theme and logging verbosity — no authentication-related fields were modified. The mechanism is implausible.

Alternative hypothesis search: If the config change didn't cause it, what else happened at 02:15? Check /var/log/syslog for that period: chronyd[1234]: System clock was stepped by -8.3 seconds. The NTP daemon performed a step correction, moving the system clock backward by 8.3 seconds. OAuth token validation uses timestamp comparison — a backward clock step caused currently-valid tokens to appear issued in the future, failing the iat (issued-at) claim check.

True causal chain: NTP step correction → clock moved backward → token iat claim appears in the future → OAuth validation rejects tokens → auth failure. The config update was coincidental.

The Junior vs Senior Gap¶

Junior	Senior
Sees temporal proximity and assumes causation: "it happened right after the deploy"	Treats temporal proximity as a starting hypothesis requiring three-lens validation
Rolls back the deploy and, when things improve, confirms it was the cause	Recognizes that improvement after rollback is consistent with causation but doesn't prove it — the system may have self-healed
Skips timeline verification across multiple log sources	Validates NTP synchronization before doing cross-system timeline analysis
Looks for evidence that confirms the initial hypothesis (confirmation bias)	Actively looks for evidence that contradicts the initial hypothesis
Records "deploy caused outage" in the post-mortem because it was the most recent change	Asks: what is the mechanism? Can we reproduce it? What alternatives have we eliminated?
Assumes "A and B both rose at the same time, so A causes B"	Considers that A and B may both be effects of a third factor C (confounding variable)

Practical Causal Claim Checklist¶

Use this checklist when evaluating a proposed causal explanation during or after an incident:

Causal Claim Evaluation Checklist
──────────────────────────────────────────────────────────────────
Claim: "<proposed cause> caused <observed effect>"

[ ] 1. TEMPORAL PRECEDENCE
        Did the proposed cause occur BEFORE the observed effect?
        Evidence: ___________________________________________
        Clock sync verified across all log sources? [ ] Yes [ ] No

[ ] 2. COVARIATION
        When the cause was present, did the effect occur?
        When the cause was absent, did the effect NOT occur?
        Evidence: ___________________________________________

[ ] 3. MECHANISM
        Can you describe the specific path by which the cause
        produced the effect?
        Mechanism: __________________________________________
        Is the mechanism verified (code read, metric confirmed)?
        [ ] Yes [ ] Plausible but not verified [ ] No

[ ] 4. ALTERNATIVE EXPLANATIONS ELIMINATED
        What other causes could explain the same effect?
        List: _______________________________________________
        Were they tested and ruled out? [ ] Yes [ ] Partial [ ] No

[ ] 5. REPRODUCIBILITY (if safe to test)
        Can the effect be reproduced by introducing the cause
        in a controlled environment?
        Result: ____________________________________________

[ ] 6. CONFOUNDER CHECK
        Is there a third variable that could cause both the
        proposed cause and the effect to occur simultaneously?
        Candidates: ________________________________________

Score: All 6 checked = strong causal claim
       4-5 checked  = reasonable working hypothesis, act cautiously
       <4 checked   = correlation only — continue investigation

During an active incident, you may only have time to check 1-3. That is acceptable — act on a reasonable hypothesis. But mark it as unverified in the incident record, and complete the checklist during the post-mortem.

Connections¶

Complements: Differential Diagnosis (Differential Diagnosis is the structured method for eliminating alternative explanations — the third condition for establishing causation; combine them: correlation points to the candidate, diff-dx eliminates alternatives, mechanism confirms causation)
Complements: Five Whys (Five Whys builds the causal chain — but each link in the chain is a causal claim that should be validated, not just asserted; applying the correlation-vs-causation lens to each "why" answer prevents constructing a plausible-sounding but incorrect chain)
Tensions: Bisect (Bisect finds the change temporally correlated with the regression, but bisect identifies correlation, not causation — always inspect the bisect result's diff and understand the mechanism before declaring root cause)
Topic Packs: observability, incident-management
Case Studies: bmc-clock-skew-cert-failure (BMC clock skew inverts the apparent order of events, making the cert failure appear to precede its actual cause; only clock sync verification restores the correct causal order), time-sync-skew-breaks-app (an NTP step correction is the true cause of an auth failure, while a coincidental config change is the false correlate — mechanism check and reproducibility test distinguish them)