The Psychology of Incidents - Street-Level Ops¶

What experienced incident responders know about the human side of outages — the biases that extend them, the team dynamics that help or hurt, and the mental models that save time.

Quick Diagnosis Commands¶

# These aren't system commands — they're human diagnostic checks.
# Run these mentally at the start of every incident:

# 1. Bias check: What's my first theory? Write it down.
#    Now write two alternatives. Investigate the fastest-to-disprove first.

# 2. Time check: How long have I been on this?
#    Set a 15-minute timer. When it rings, reassess.

# 3. Stress check: Am I past the performance curve peak?
#    Signs: irritability, tunnel vision, repeating steps, clock surprise.

# 4. Team check: Is anyone working on the same thing as me?
#    Ask in the channel: "I'm checking [X]. Anyone else looking at this?"

# 5. Evidence check: What have I actually proven (not assumed)?
#    List confirmed facts vs. theories. Act on facts.

# For the IC — run this loop every 10 minutes:
# "What do we know? What are we investigating?
#  Who is assigned to what? What's blocked? Who needs help?"

Gotcha: The 3am Brain¶

It's 3am. You've been asleep for 2 hours. Page fires. Your cognitive function is at approximately 60% of normal. Your short-term memory is impaired. Your decision-making is comparable to a blood alcohol level of 0.05. You are not equipped to make complex judgment calls.

Fix: At 3am, follow the runbook. Do not improvise. Do not make architectural decisions. If the runbook doesn't cover this situation, execute the simplest safe action (rollback, failover to standby, page the secondary) and escalate. Write a note of what you observe — your 3am observations are useful data, but your 3am judgments are unreliable.

Gotcha: The Expert Who Freezes¶

Your most senior engineer joins the war room. Everyone expects them to solve it. They stare at the screen. They don't type anything. Five minutes pass. They seem stuck. What's happening: analysis paralysis. They know enough to see how complex the problem might be. Juniors would have already started poking things. The expert is frozen because they're running through all the ways their first action could make things worse.

Fix: Give experts permission to be wrong. "What's your best guess with current information? We can always pivot." Break the paralysis by asking for a small, reversible action: "What's the cheapest thing we can check to narrow this down?" Expertise sometimes needs a nudge to move from analysis to action.

Under the hood: This is the Dunning-Kruger effect working in reverse. Novices act quickly because they underestimate complexity. Experts hesitate because they see all the ways an action could cascade. Neither extreme is optimal during an incident. The IC's job is to calibrate the pace: slow enough for evidence-based decisions, fast enough to prevent customer impact from compounding.

Gotcha: The Quiet Person With the Answer¶

Fifteen people are in the war room. The junior engineer on their second week noticed something in the logs that everyone else missed. They don't speak up because the senior engineers are talking and they don't want to seem presumptuous. The answer sits in their terminal for 20 minutes.

Fix: IC must actively solicit input from quiet participants. "Alice, you've been looking at the logs — anything unusual?" Create a low-barrier way to contribute: a dedicated thread for "observations" where people post raw data without needing to frame it as a theory. Normalize partial information: "I don't know what it means, but I see X."

Gotcha: Action Bias — Doing Something Wrong Feels Better Than Waiting¶

The monitoring shows a gradual degradation. The right move is to observe for 5 more minutes to gather data. But the pressure to "do something" is overwhelming. Someone restarts a service. Someone changes a config. Someone scales up the cluster. Now there are three simultaneous changes and you can't tell what's a cause, what's a fix, and what's making things worse.

Fix: The IC controls the pace of action. One change at a time. "We're going to observe for 5 minutes before making any changes." If someone wants to take action: "Tell me what you want to do and why. I'll approve or queue it." Controlled, sequential changes with observation windows between them.

Gotcha: Blame During the Incident¶

Fifteen minutes into the incident, someone asks "Who deployed this?" The tone is accusatory. The engineer who deployed goes quiet. They had context about what changed and how to roll it back. That context is now locked behind fear.

Fix: The IC shuts this down immediately and publicly: "We don't do blame during incidents. We're focused on recovery. Alice, can you tell us about the recent deploy so we can assess whether a rollback would help?" Redirect from "who did this" to "what happened and what do we do about it." The postmortem is where you discuss process failures — never during the incident.

Pattern: The IC's Mental Loop¶

Every 10 minutes, the IC should run this mental checklist:

The IC Loop (every 10 minutes):

  1. STATUS:   "What is the current customer impact?"
  2. THEORY:   "What is our leading hypothesis?"
  3. EVIDENCE: "What evidence supports or contradicts it?"
  4. ACTION:   "What is the next action and who is doing it?"
  5. BLOCKED:  "Is anyone stuck or waiting for something?"
  6. TIME:     "How long has this been going on? Do we need to escalate?"
  7. COMMS:    "When was the last external update? Is one due?"
  8. PEOPLE:   "Is anyone fatigued? Does anyone need a break or rotation?"

  Post the summary to the channel:
  "Update: [impact]. Theory: [X]. Evidence: [Y]. Next step: [Z] by @person.
   Next update in 10 minutes."

Pattern: The Pre-Incident Briefing¶

Before on-call shifts, brief the incoming engineer on psychological readiness:

Pre-Shift Briefing:

  1. Known risks: "Marketing push tomorrow — expect 2x traffic"
  2. Recent incidents: "Database failover last week — watch for replica lag"
  3. Team state: "Bob is on vacation, secondary is Carol this week"
  4. Self-care reminders:
     - Keep your phone charged and ringer on
     - Don't drink alcohol during your shift
     - If you get paged at night, take a light day tomorrow
     - If you feel overwhelmed during an incident, escalate — it's the smart move
  5. Runbook locations: [links]
  6. Escalation contacts: [names and numbers]

Pattern: The Debrief Cooldown¶

After a major incident, don't jump straight into the postmortem. Use a structured cooldown:

Incident Cooldown Protocol:

  Hour 0 (resolution): IC declares "all clear."
    - Thank everyone who participated
    - Assign postmortem author
    - Schedule postmortem review for 48 hours out

  Hour 0-2: Decompress
    - Team takes a break (coffee, walk, food)
    - No technical discussion about the incident
    - Check in on the person most affected: "How are you doing?"

  Hour 2-4: Light documentation
    - Scribe cleans up the timeline from the incident channel
    - Preserve artifacts (dashboards, log snippets, chat transcripts)
    - No analysis yet — just facts

  Hour 24-48: Postmortem draft
    - Author writes draft in calm, reflective state
    - Uses facts collected, not 3am memories

  Hour 48-72: Review meeting
    - Team reviews postmortem together
    - Focus on systems and processes, not individuals
    - Assign action items with owners and dates

Pattern: The Cognitive De-Bias Toolkit¶

Bias               | Counter-Measure              | When to Apply
────────────────────┼──────────────────────────────┼─────────────────
Anchoring           | Write 3 hypotheses first      | Start of incident
Confirmation        | Try to disprove your theory   | During investigation
Sunk cost           | 15-minute timebox             | Every 15 minutes
Plan continuation   | Pre-define abort criteria     | Before any remediation
Availability        | Check data, not memory        | When "this happened before"
Action bias         | Observe 5 min before acting   | When urge to "do something"
HiPPO effect        | IC leads, rank doesn't        | When senior person speaks
Diffusion of resp.  | Assign names to tasks         | In war room

None of these are natural. They all require practice.
Drill them in game days so they're automatic in incidents.

Remember: The two most dangerous biases in incident response are anchoring (fixating on the first theory) and sunk cost (continuing a failing approach because you already invested time). Counter both with one question: "If I just walked in fresh, would I still pursue this path?" If not, pivot immediately.

Emergency: Team Member in Visible Distress During Incident¶

Someone on the team is clearly overwhelmed — shaky voice on the bridge, silence after being asked a question, or visible frustration/anger in chat.

1. IC privately messages them: "Hey, how are you doing?
   No pressure — you can step back if you need to."

2. Give them an easy, concrete task or permission to disengage:
   "Can you handle comms updates for the next 30 minutes?"
   or "Why don't you take a break and come back fresh?"

3. Do NOT call attention to their distress publicly.

4. After the incident, check in privately.
   "That was a tough one. Want to talk about it?"

5. If this happens repeatedly, it may indicate:
   - On-call load too high (systemic issue)
   - Prior incident trauma (needs support)
   - Role mismatch (not everyone is wired for incident response)
   All of these are management problems, not personal failures.

One-liner: "Are you okay?" is the most powerful and underused question in incident response. Asking it privately via DM takes 10 seconds and can prevent burnout, mistakes from stress, and long-term attrition from your on-call rotation.

Emergency: IC Is the One Who Made the Mistake¶

You're IC. You realize the incident was caused by your deploy, your config change, or your decision. Your impulse is to hide it or to over-compensate by working frantically.

1. Disclose immediately. "I think the deploy I pushed at 14:00
   may be related. Let me share what changed."

2. The information you have is the fastest path to resolution.
   Withholding it to protect your ego extends the incident.

3. If you can't be objective: hand off IC to someone else.
   "I'm too close to this — Carol, can you take IC?"

4. The team will respect you more for transparency.
   Everyone causes incidents. Not everyone has the integrity
   to say so during one.

5. In the postmortem, the same honesty applies.
   "I deployed without checking the canary metrics."
   The postmortem fixes the process, not the person.