Mental Model: Normalization of Deviance¶
Category: Human Factors Origin: Diane Vaughan, sociologist — coined after her exhaustive sociological study of the 1986 Space Shuttle Challenger disaster, published in The Challenger Launch Decision (1996) One-liner: Small, tolerated violations of standards become invisible over time because they never immediately cause harm — until one day they do.
The Model¶
Normalization of deviance describes a gradual process by which behaviors or conditions that were once recognized as risks become accepted as normal. The mechanism is simple: a team deviates from a standard (a skipped step, an ignored warning, a known-bad config), nothing bad happens, and the deviation is tacitly endorsed. Repeat enough times, and the deviation stops being perceived as a deviation at all. It is just how things are done here.
The critical word is gradual. No individual on the team decides to lower their standards. The team does not hold a meeting and vote to start ignoring alerts. Instead, each small exception is context-dependent and locally reasonable — the runbook step "doesn't apply to this environment," the alert "always fires on Mondays," the flaky test "never catches anything real." Over months or years, these local judgments accumulate into a culture that has silently drifted far from its stated safety standards. When a reviewer from outside looks at the practices, they are alarmed. Insiders no longer see what the outsider sees.
The other critical word is deviance — not in the moral sense, but in the statistical sense: deviation from the norm. Vaughan's insight is that organizations continuously produce deviant signals (warnings, near-misses, anomalies) but learn to reinterpret them as normal. The O-ring resilience concern for Challenger was raised repeatedly and repeatedly absorbed. Engineers knew the O-rings performed worse in cold temperatures. Over 24 flights without catastrophe in cold conditions, the data point was rationalized away each time. The final launch decision was not made by reckless people — it was made by engineers who had normalized a known-bad signal through years of survived flights.
In ops and SRE contexts, the pattern is ubiquitous. Flaky tests that "everyone knows" don't catch real bugs get commented out or marked expected-fail. Disk usage alerts that fire every weekend because of batch jobs get silenced rather than fixed. Runbook steps like "verify the backup is current" get skipped because restores "never happen." Deployment pipelines accumulate bypasses — --force, --skip-checks, --ignore-warnings — each justified in the moment, collectively representing a system that no longer has the safeguards it says it has. The difference between this organization and a safe one is not visible in any single decision; it's only visible in the pattern across decisions over time.
The boundary conditions for this model: it applies most strongly in complex sociotechnical systems with infrequent feedback cycles — where failures are rare, consequences are delayed, and teams work under sustained pressure to ship. It applies less in systems where feedback is fast and direct (a bug that crashes immediately will not be tolerated). It also applies less when organizations have active mechanisms for surfacing near-misses — psychological safety to raise concerns, blameless postmortems, and explicit near-miss reporting. The antidote is not willpower but structure: making deviance visible before it becomes normal.
Visual¶
Time →
──────────────────────────────────────────────────────────────────────→
Standard ████████████████████████████████████████████████████ (stated)
████████████████████████████████ (actual — begins drifting)
████████████████████ (further drift)
████████████ (drift continues)
███████ (where the team actually operates)
██
▲
Catastrophic failure
(team surprises itself)
Each step down:
deviation occurs → nothing bad happens → deviation becomes "normal"
→ new deviation occurs on top of old one → repeat
Near-miss
signals: ↑ ↑ ↑ ↑ ↑ ↑ ↑ ← each one absorbed/rationalized
Each signal is visible but not acted on — "that's just noise here"
Drift Scorecard (what it looks like in an ops team):
Signal / Practice Stated Standard Normalized Reality
────────────────────────── ─────────────────── ──────────────────────
Disk usage alert Page at 80% Silenced; "fires daily"
Flaky CI test Must pass green Expected-fail for 8 mo
Change freeze period No deploys Friday "Small fix" exceptions
Backup verification step Always run "We haven't needed it"
Terraform plan review Two approvals One approval "when busy"
When to Reach for This¶
- After a postmortem when the team asks "how did we get here?" — the incident did not come from nowhere; trace the accumulated deviations
- When auditing on-call runbooks and finding steps that "everyone knows" to skip
- When reviewing alert noise: if an alert always fires and is always silenced, it has been normalized
- When new team members express surprise at a practice that veterans no longer notice
- When a near-miss is reported and the first reaction is "oh, that happens all the time"
- When an outage postmortem reveals that the warning signal existed for months
When NOT to Use This¶
- Do not use it to retroactively blame individuals for decisions that were locally reasonable at the time — the model explains a systemic drift, not individual failures of character; applying it to blame a person is itself a form of hindsight bias
- Do not apply it in systems with fast, tight feedback loops where the "drift" hypothesis doesn't hold — a service that crashes on every bad deploy self-corrects quickly without normalization accumulating
- Do not treat "this is how we've always done it" as automatically a normalization risk — some practices are genuinely low-risk and the team's judgment that they don't require the standard procedure is correct; the model is about risk-bearing deviations, not all procedural flexibility
- Do not use it as an excuse to never accept any pragmatic exception — some deviations are conscious risk trade-offs, not normalization; the difference is whether the deviation is visible, deliberate, and revisited or invisible, habitual, and forgotten
Applied Examples¶
Example 1: The Silenced Disk Alert¶
A team runs a batch analytics job every Sunday night that generates large temporary files. The disk-usage alert (threshold: 80%) fires every Sunday at 2 AM. The on-call engineer silences it and goes back to sleep. After six months, the alert has an auto-silence rule in PagerDuty. After twelve months, no one remembers why the rule exists.
In month fifteen, a runaway log rotation bug causes the root filesystem — not the data volume — to fill at 2 AM on a Sunday. The alert fires. The auto-silence rule triggers. The root filesystem hits 100% at 4 AM. sshd cannot write its pid file. The team cannot log in. Services that write to /var/log begin failing. The outage lasts four hours.
Applying the model: trace backward. The signal (disk alert at 2 AM) was real every time it fired. The team's response (silence) was locally reasonable each time. The accumulated result was a system with no disk alerting — the team believed they had alerting, but the stated standard and the operational reality had diverged completely. The runbook said "disk alerts page on-call." The operational reality was "disk alerts are silenced."
Fix: whenever a signal is silenced or suppressed, treat the suppression as a tracked deviation with an owner, an expiry date, and a review trigger. Suppressing a signal should cost effort, not save it.
Example 2: The --force-push Habit¶
A small platform team manages a monorepo. Early on, the CI pipeline has a flaky test that occasionally fails the merge-gate check. In the first week, a senior engineer shows a junior how to re-run the check: "if it's the flaky one, just re-run." In the second week, someone shows them --force-merge in the GitHub UI when re-running is too slow. Within two months, --force-merge is the default for anyone in a hurry.
Eighteen months later, the team has grown. New engineers see --force-merge used by senior engineers and adopt it. The original flaky test was fixed in month three; no one noticed that the justification for the behavior had disappeared. Now the merge gate is bypassed for all reasons, including real failures. A critical security-lint check fails in CI; the engineer, in a hurry, --force-merges. The vulnerability ships to production.
Applying the model: the original deviation (bypass for a known-flaky test) was reasonable. The normalization occurred when the bypass became the default path rather than a deliberate exception. The test of whether a deviation is normalized: can any team member articulate why this exception exists, when it applies, and who decides? If the answer is vague or "we just do it this way," normalization has occurred.
The Junior vs Senior Gap¶
| Junior | Senior |
|---|---|
| Treats each alert noise, flaky test, or skipped step as an isolated nuisance to work around | Treats the same signals as evidence of accumulated drift — asks "why does this keep happening?" |
| Silences the alert because it's easier than fixing the underlying cause | Tracks suppressed alerts as technical debt; refuses to let suppressions go stale |
| Inherits a practice ("everyone skips this step") without asking why it's skipped | Questions inherited exceptions: "was this a deliberate decision or did we just drift here?" |
| Perceives the team's tolerance of known-bad practices as evidence those practices are safe | Knows that survival so far is not evidence of safety; distinguishes "we haven't been hurt yet" from "this is safe" |
| Focuses on the immediate task; drift is invisible at the individual decision level | Maintains a longer time horizon; periodically audits stated standards against operational reality |
| Accepts that "things are always messy in practice" without distinguishing productive pragmatism from erosion | Distinguishes between conscious trade-offs (risk accepted, documented, owned) and normalized deviance (risk invisible) |
Connections¶
- Complements: Alert Fatigue (see
alert-fatigue.md) — alert fatigue is the mechanism by which ops teams normalize signal suppression at scale; normalization of deviance explains why the suppression stops feeling like a problem - Complements: Hindsight Bias (see
hindsight-bias.md) — after a normalized-deviance failure, hindsight bias causes investigators to believe the drift was obvious; understanding both models together produces better postmortems - Tensions: Chesterton's Fence — the principle that you shouldn't remove a rule until you understand why it exists; normalization of deviance is partly caused by forgetting why a rule exists, so Chesterton's Fence is the structural antidote; tension arises when teams invoke Chesterton to resist legitimate relaxation of outdated rules
- Topic Packs: incident-management, alerting-rules
- Case Studies: disk-full-root-services-down (normalized disk alert suppression led to detection failure), link-flaps-bad-optic (repeated link-flap events tolerated until physical failure; this model explains why the signal was not acted on)