Skip to content

The Psychology of Incidents Footguns

Mistakes that turn your incident response into a blame circus, your war rooms into stress factories, and your team's mental health into collateral damage.


1. Postmortem immediately after the incident

The incident resolves at 4am after a 3-hour battle. Someone schedules the postmortem for 9am that morning. The team is exhausted, emotional, and defensive. The postmortem devolves into finger-pointing or awkward silence. Nobody learns anything useful. Several people leave the meeting feeling worse.

Fix: Schedule the postmortem 24-48 hours after resolution. People need time to process emotions, rest, and gain perspective. Collect timeline facts immediately (while fresh), but do the analysis after the adrenaline has cleared.


2. "Blameless" postmortem that blames with softer words

Your postmortem says "The engineer should have verified the migration on a staging database of comparable size." That's blame in a blazer. The engineer who ran the migration knows it's about them. Everyone knows. The word "blameless" in your template doesn't make it so.

Fix: Rewrite every "should have" as a system failure. Not "the engineer should have checked" but "the process did not include a mandatory staging verification step for migrations over 1M rows." Fix the system, not the sentence structure.

Remember: Sidney Dekker's "Just Culture" test: replace the person's name with "a new hire on their first week." If the same outcome is plausible, the system failed — not the individual. This reframing reliably exposes process gaps that blame language hides.


3. On-call as punishment

A junior engineer makes a mistake. Their manager puts them on extra on-call shifts to "build ownership." The engineer learns that mistakes lead to suffering. They stop reporting near-misses. They hide issues rather than escalating.

Fix: On-call is a responsibility distributed by rotation, never a disciplinary action. If someone needs to develop skills, pair them with a senior on-call as secondary — with mentorship, not punishment.

War story: A 2019 Honeycomb blog post documented how one team's "punishment on-call" policy led to a 40% attrition rate among junior engineers within 18 months. The engineers who stayed developed learned helplessness — they stopped reporting near-misses and hid incidents to avoid additional on-call shifts. The team's incident detection rate dropped 60%.


4. Ignoring on-call burnout because "everyone does it"

Your four-person team has been doing 24/7 on-call for two years. One person has chronic sleep disruption. Another has developed anxiety about their phone buzzing. You rationalize: "Every ops team does on-call." Yes, but healthy ops teams track on-call health and intervene before burnout.

Fix: Track pages per shift, sleep interruptions, and subjective on-call satisfaction. Review monthly. If any metric is degrading, act: fix noisy alerts, add team members, or reduce on-call scope. Burnout is a trailing indicator — by the time someone says "I can't do this," the damage is months old.

Under the hood: Sleep science research shows that a single night of interrupted sleep (common during on-call) reduces cognitive performance by 20-30% the next day — equivalent to a blood alcohol level of 0.05-0.1%. On-call engineers who were paged at 3 AM are making impaired decisions the following day.


5. Senior engineers crowding out junior voices

SEV-1 incident. Three senior engineers are debating theories on the bridge. Two junior engineers have been quietly investigating and found relevant data. They don't interrupt because the seniors are talking authoritatively. The data that would resolve the incident sits unshared.

Fix: IC must actively solicit input from all participants, especially quieter ones. "We've heard from the seniors. Anyone else see anything?" Create a structured way to contribute: a shared doc where anyone can post observations without needing to interrupt a verbal discussion.


6. Treating every incident like a crisis

A SEV-3 issue affects one customer. The on-call engineer treats it like a SEV-1: opens a war room, pages three teams, sends a company-wide email. The actual SEV-3 is resolved in 10 minutes, but the organizational disruption takes an hour to settle. Next time a real SEV-1 happens, people are desensitized to the escalation.

Fix: Match response intensity to severity. SEV-3 doesn't need a war room — it needs one engineer and a ticket. Reserve the full incident response machinery for SEV-1 and SEV-2. If everything is a crisis, nothing is.


7. No debrief for incidents that resolve quickly

The incident lasted 8 minutes. Quick rollback, no damage. No postmortem because "it was a short one." But the same deployment pattern has caused three 8-minute incidents this quarter. Without a postmortem, nobody connects the dots. The fourth time, the rollback doesn't work and it becomes a 2-hour SEV-1.

Fix: Quick incidents still deserve lightweight retrospectives. Not a full postmortem — a 5-minute written note: what happened, why, what prevents recurrence. The pattern emerges from the collection of small incidents, not from individual big ones.

Under the hood: Heinrich's Triangle (industrial safety research) found that for every major incident, there are roughly 29 minor incidents and 300 near-misses. The same ratio applies to software systems. Each quick rollback is a near-miss signal. If you only investigate major outages, you're ignoring 99% of your learning opportunities.


8. Measuring mean-time-to-resolve but not human cost

Your MTTR is 25 minutes. Leadership celebrates. But those 25-minute resolutions involve a single on-call engineer at 3am, operating under extreme stress, making risky judgment calls solo. The MTTR is low because the human absorbs all the cost.

Fix: Measure both MTTR and human cost: pages per shift, sleep interruptions, time-to-first-escalation, post-incident satisfaction. A team that resolves in 25 minutes with two people and no stress is healthier than one that resolves in 15 minutes with one person and a panic attack.


9. The hero culture

One engineer resolves every major incident. They're celebrated as the hero. The team becomes dependent on them. When they take vacation, the team can't handle a SEV-2. The hero burns out and quits. The team's incident capability drops to zero.

Fix: Distribute incident response skills deliberately. Pair junior engineers with seniors during incidents. Rotate the IC role so everyone builds the muscle. Celebrate team resilience, not individual heroics. The goal is a team that can handle anything, not a person.


10. Not accounting for cognitive bias in your tools and processes

You know about anchoring, confirmation bias, and sunk cost fallacy. You've read the blog posts. But your incident process has no structural safeguards against them. No timers, no mandatory hypothesis rotation, no escalation triggers. You're relying on individuals to overcome their own biases in real-time under stress. They can't. Nobody can.

Fix: Build bias counter-measures into the process, not the person. Mandatory 15-minute reassessment timer. Three-hypothesis rule before investigation begins. Time-boxed escalation criteria. Pre-defined abort criteria for remediations. These are process steps, not suggestions. They run whether you feel biased or not.

Gotcha: Confirmation bias during incidents: once someone declares "it's the database," the team unconsciously filters all evidence to support that theory and dismisses contradicting data. The three-hypothesis rule forces the team to actively maintain competing explanations until one is definitively proven.