Skip to content

Portal | Level: L2: Operations | Topics: Incident Psychology, Incident Response, On-Call & Incident Command | Domain: DevOps & Tooling

The Psychology of Incidents - Primer

Why This Matters

The hardest part of incident response is not technical. It's human. Cognitive biases distort your thinking under pressure. Stress narrows your attention to the wrong things. Team dynamics turn a war room into a blame session or a silent standoff. The best technical skills in the world don't help if your brain is sabotaging your decision-making at 3am.

In the Navy, we trained for high-stress decision-making because lives depended on it. The same principles apply to incident response — the stakes are different, but the cognitive traps are identical. Understanding how your brain fails under pressure is the difference between a 15-minute incident and a 3-hour disaster.

Core Concepts

1. How Stress Affects Decision-Making

The Stress-Performance Curve (Yerkes-Dodson):

  Performance
       │         ╱╲
       │        ╱  ╲
       │       ╱    ╲
       │      ╱      ╲
       │     ╱        ╲
       │    ╱          ╲
       │   ╱            ╲
       │──╱──────────────╲──────
       └─────────────────────────
       Low     Optimal     High
              Stress Level

  Low stress:   Bored, unfocused, slow to respond
  Optimal:      Alert, creative, fast decision-making
  High stress:  Tunnel vision, cognitive rigidity, poor decisions

  At 3am after 2 hours of debugging:
  → You are past optimal. Your decisions are getting worse.
  → This is when escalation saves the incident.

Under high stress, your brain does predictable things:

Stress Effect What Happens Incident Impact
Tunnel vision Focus narrows to one thing Miss the actual root cause
Working memory shrinks Can hold fewer facts simultaneously Lose track of what you've checked
Default to familiar Reach for what you know, not what's needed Apply wrong fix from a different incident
Time distortion Minutes feel like seconds Spend 45 minutes on one theory without noticing
Communication degrades Shorter, vaguer, more assumptive Team members work at cross purposes

2. Cognitive Biases During Outages

Anchoring Bias

What it is:
  The first piece of information you encounter becomes an anchor
  that distorts all subsequent analysis.

Example:
  Someone says "I think it's the database" at minute 2 of the incident.
  For the next hour, every investigation is database-focused.
  The actual cause is a DNS change that happened 30 minutes ago.
  But once "database" was anchored, nobody checked DNS.

Military parallel:
  In threat assessment, the first report from the field shapes all
  subsequent analysis — even when later reports contradict it.
  We trained to explicitly re-evaluate initial assumptions every 15 minutes.

Counter-measure:
  Write down three hypotheses before investigating any.
  Set a 15-minute timer. When it rings, ask: "Are we still pursuing
  the right theory? What evidence do we have against it?"

Confirmation Bias

What it is:
  You search for evidence that supports your theory and
  unconsciously ignore evidence that contradicts it.

Example:
  Theory: "The deploy broke something."
  You find one error log entry near the deploy time → "See! Confirmed!"
  You ignore that the error rate actually started 10 minutes before the deploy.
  You ignore that the same error exists in yesterday's logs.

Counter-measure:
  Actively try to disprove your theory. Ask:
  "What evidence would prove me WRONG?"
  "If this theory is correct, what else should I see?"
  "If this theory is wrong, what should I check instead?"

Sunk Cost Fallacy

What it is:
  You've invested 45 minutes investigating a theory.
  Abandoning it feels like wasting that time.
  So you keep going even when the evidence doesn't support it.

Example:
  "I've been tracing this network path for 40 minutes.
   I can't just throw that away. Let me check one more thing..."
  Meanwhile, the actual cause is obvious to someone who just joined.

Counter-measure:
  Time-box investigations. Hard 15-minute rule.
  After 15 minutes: "What have I learned? Is this still the best path?"
  Past time spent is irrelevant to the correct next action.

Plan Continuation Bias

What it is:
  Once you've started a remediation, you're reluctant to abort
  even when signals indicate it's not working or making things worse.

Example:
  You start a database failover. Midway through, you realize the
  replica is also degraded. Continuing the failover will make things
  worse. But the failover is 60% complete. You continue because
  "we're already this far in" instead of aborting.

Military parallel:
  In aviation, plan continuation bias kills pilots. The runway
  is too short. The weather is worse than forecast. But you're
  already on approach, so you continue instead of going around.

Counter-measure:
  Define abort criteria BEFORE starting any remediation.
  "If error rate doesn't decrease within 5 minutes of rollback,
  we abort the rollback and try a different approach."

3. The "Just Reboot It" Impulse

Why it's tempting:
  - Rebooting is a known action with a predictable outcome
  - It feels like "doing something" (action bias)
  - It has worked before (availability heuristic)
  - The uncertainty of diagnosis is uncomfortable

Why it's dangerous:
  - Destroys evidence (process state, memory contents, connection state)
  - Masks the root cause (it'll come back)
  - May not help (if the cause is external)
  - May make it worse (if state is corrupted, reboot may not recover)

When rebooting IS correct:
  - Kernel panic or unresponsive kernel
  - Hardware error requiring reinitialization
  - Known issue with documented reboot-as-fix (with ticket for root cause)

The rule:
  Capture state FIRST. Then reboot if needed.
  A restart that fixes something temporarily is a clue, not a solution.

4. Team Dynamics in War Rooms

The HiPPO Effect (Highest Paid Person's Opinion)

What it is:
  The most senior person in the room states a theory.
  Nobody contradicts them — even when they're wrong.
  The investigation follows the senior person's theory
  regardless of evidence.

Example:
  VP of Engineering joins the war room. Says "This looks like
  the same issue we had in Q3 — check the CDN."
  Three engineers stop what they're doing and check the CDN.
  The CDN is fine. But nobody says so because the VP suggested it.

Counter-measure:
  IC runs the investigation, not the highest-ranking person.
  The VP's role is to provide resources and remove blockers,
  not to direct the technical investigation.
  "Thank you for the suggestion — we'll add it to the hypothesis list."

The Silent War Room

What it is:
  Ten people in a channel. Nobody talks. Everyone is
  individually investigating. Nobody shares findings.
  Three people check the same thing. Two people check nothing
  because they assume someone else is on it.

Why it happens:
  - Diffusion of responsibility ("someone else will speak up")
  - Fear of saying something wrong under pressure
  - Unclear roles and expectations

Counter-measure:
  IC actively polls: "Bob, what did you find on the database?"
  IC assigns specific tasks: "Carol, check the deploy history and report back in 5."
  Encourage partial findings: "I don't have the answer yet, but I can rule out X."

The Blame Spiral

What it is:
  During the incident (not after), people start assigning blame.
  "Who pushed this code?" "Why didn't QA catch this?" "Who changed the config?"

What it causes:
  - People stop sharing information (self-protection)
  - The person who caused the issue hides relevant context
  - Investigation stalls because people are defending, not diagnosing

Counter-measure:
  IC immediately shuts down blame language:
  "We're focused on fixing this, not finding fault.
   We'll do a blameless postmortem after resolution."
  Model the behavior: "What happened?" not "Who did this?"

5. Building Psychological Safety

Psychological Safety = people believe they can:
  - Ask questions without looking stupid
  - Admit mistakes without being punished
  - Raise concerns without being dismissed
  - Offer ideas without being ridiculed

Why it matters for incidents:
  - The person who caused the incident often has the most context
  - If they're afraid to speak up, you lose critical information
  - Junior engineers may notice things seniors miss
  - Near-miss reporting only happens when people feel safe reporting

How to build it:
  1. Leaders go first: share their own mistakes publicly
  2. Thank people for speaking up, especially with bad news
  3. Blameless postmortems (consistently, not just when it's easy)
  4. Celebrate near-miss reports as valuable intelligence
  5. Never punish someone for causing an incident if they were
     following the process (fix the process, not the person)

6. Post-Incident Emotional Processing

After a major incident, people feel:
  - Relief (it's over)
  - Guilt ("I caused this" or "I should have caught it sooner")
  - Anger ("Why wasn't this prevented?")
  - Exhaustion (physical and emotional)
  - Anxiety ("Will it happen again?")

What NOT to do:
  - Jump straight into the postmortem while emotions are raw
  - Dismiss feelings ("It's just production, relax")
  - Publicly identify the person who caused the incident
  - Schedule the person for on-call the next day

What TO do:
  - Acknowledge the stress: "That was a tough one. How's everyone doing?"
  - Give people time to decompress (minimum: finish the workday early)
  - Schedule the postmortem 24-48 hours later, not immediately
  - Check in with the person closest to the cause — privately, supportively
  - Normalize the experience: "Incidents happen. That's why we have this process."

7. Decision Fatigue and Handoffs

Decision fatigue:
  After 2+ hours of continuous decision-making, quality degrades.
  You start defaulting to the easiest option, not the best one.
  You start avoiding decisions entirely ("let's wait and see").

Signs you're fatigued:
  - Repeating the same checks you already did
  - Staring at dashboards without processing the information
  - Agreeing to actions you'd normally question
  - Getting irritable with teammates

Counter-measure:
  Rotate the IC role during long incidents.
  IC handoff every 2 hours (or sooner if fatigued).
  The handoff is explicit: current state, hypotheses tested,
  next steps, what hasn't been checked yet.
  The outgoing IC takes a real break — not "I'll keep watching."

Common Pitfalls

  1. Thinking you're immune to bias — You're not. The biases are strongest in people who believe they're purely rational. Awareness helps, but process (timers, checklists, role rotation) is more reliable than willpower.
  2. Running postmortems while emotions are hot — The meeting turns into blame or defensiveness. Wait 24-48 hours. Let people process the emotions first.
  3. Ignoring the human cost of incidents — Frequent high-severity incidents cause cumulative stress. Track incident load per person, not just per service. Burnout is a reliability risk.
  4. Not training for stress — The time to practice incident response is not during an incident. Run game days, tabletop exercises, and chaos experiments so the process is muscle memory when it matters.
  5. Equating speed with competence — The fastest debugger is not always the best incident responder. Coordination, communication, and judgment under pressure matter more than raw technical speed.
  6. Treating the "just reboot it" engineer as incompetent — Sometimes a reboot IS the right call. The problem is when it's the ONLY tool in the toolbox. Build diagnostic skills alongside operational instincts.

Wiki Navigation

  • Incident Command & On-Call (Topic Pack, L2) — Incident Response, On-Call & Incident Command
  • Runbook Craft (Topic Pack, L1) — Incident Response, On-Call & Incident Command
  • Vendor Management & Escalation (Topic Pack, L1) — Incident Response, On-Call & Incident Command
  • Change Management (Topic Pack, L1) — Incident Response
  • Chaos Engineering Scripts (CLI) (Exercise Set, L2) — Incident Response
  • Debugging Methodology (Topic Pack, L1) — Incident Response
  • Incident Psychology Flashcards (CLI) (flashcard_deck, L1) — Incident Psychology
  • Incident Response Flashcards (CLI) (flashcard_deck, L1) — Incident Response
  • Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Incident Response
  • Investigation Engine (CLI) (Exercise Set, L2) — Incident Response