How Incident Response Actually Works

lesson
incident-command
triage
communication
cognitive-biases
postmortems
blameless-culture ---# How Incident Response Actually Works

Topics: incident command, triage, communication, cognitive biases, postmortems, blameless culture Level: L1–L2 (Foundations → Operations) Time: 60–75 minutes Prerequisites: None

The Mission¶

It's 2:47 AM. PagerDuty fires. Your service is down. Users are affected. What do you do in the first 60 seconds? The first 5 minutes? The first hour?

Most engineers have never been formally taught incident response. They learn by osmosis — watching senior engineers during real incidents, absorbing habits both good and bad. This lesson teaches the structured approach that companies like Google, PagerDuty, and Netflix use, plus the cognitive traps that make incidents worse.

The First 60 Seconds: AAVCE¶

When the page fires, follow AAVCE:

Acknowledge — Stop the escalation timer. Tell the system you're on it.
Assess — Read the alert. What service? What symptom? What severity?
Verify — Is it real? Check a second data source. Don't trust one metric.
Communicate — Open an incident channel. Post what you know (even if it's little).
Escalate — If you can't fix it in 15 minutes, escalate NOW, not later.

2:47 AM — PagerDuty fires: "API error rate > 5%"
2:47:15 — Acknowledge (stop escalation timer)
2:47:30 — Read alert. API service. Error rate 8%. Started 2:43.
2:48 — Verify: check Grafana dashboard. Confirm errors are real, not a metric glitch.
2:48:30 — Open #incident-2026-0322 in Slack. Post:
           "Investigating API error rate spike. 8% errors since 2:43. Checking."
2:49 — Check recent deploys. Check recent config changes. Check dependency health.

Mental Model: The first 5 minutes of an incident set the trajectory for the entire response. A structured start — acknowledge, assess, communicate — prevents the chaos spiral of multiple people investigating independently, duplicating work, and nobody coordinating.

Roles: Separate Coordination from Investigation¶

In a serious incident, one person cannot both investigate AND coordinate. Assign roles:

Role	Responsibility	Does NOT
Incident Commander (IC)	Coordinates, assigns tasks, makes decisions	Debug code, SSH into servers
Technical Lead	Hands on keyboard, investigates, implements fixes	Talk to customers, update statuspage
Communications Lead	Updates statuspage, writes user-facing messages	Investigate technical details
Scribe	Records timeline, actions taken, decisions made	Make decisions or investigate

For small teams (1-2 people on-call), one person wears all hats. For serious incidents (SEV-1, multiple teams involved), separating roles prevents the IC from getting tunnel- visioned on one theory.

Name Origin: The Incident Command System (ICS) was developed by California firefighters in the 1970s after wildfire coordination failures killed people. The FIRESCOPE program created a structured hierarchy for emergency response. Tech companies adapted it in the 2010s, led by PagerDuty and Google SRE. The discipline translates perfectly: in emergencies, unclear roles cost uptime (or lives).

The IC's job: "Coordinate, don't operate"¶

The IC does NOT: - Debug code - SSH into servers - Write queries - Get tunnel-visioned on one theory

The IC DOES: - Assign investigation tasks to specific people - Set 15-minute check-in timers - Ask "what do we know?" and "what have we tried?" - Decide when to escalate, when to rollback, when to communicate - Shut down blame language ("we're fixing, not faulting")

Severity: How Bad Is It?¶

Is the service completely down?
  YES → SEV-1 (all hands, page everyone)

Are >10% of users affected without workaround?
  YES → SEV-2 (incident team, active investigation)

Are some users affected?
  YES → SEV-3 (on-call investigates, normal hours follow-up)

No users affected?
  SEV-4 (log it, fix in sprint)

Severity	Response	Communication	Escalation
SEV-1	All-hands war room	External statuspage + customer email	VP notified
SEV-2	Incident team assembled	External statuspage	Manager notified
SEV-3	On-call investigates	Internal Slack	Team lead notified
SEV-4	Normal priority	Jira ticket	None

The 3R Mitigation Priority¶

When the system is broken, your first job is to restore service — not find root cause.

Rollback — Revert the last deploy. This resolves ~60% of SEV-1s.
Restart — Bounce the service. Clears transient state.
Rescale — Add capacity (more pods, bigger instance).

If none of these work, then investigate. Root cause analysis can wait — user impact cannot.

Gotcha: "But we need to understand what happened!" — Yes, after the fire is out. Investigating root cause during an active incident often means the outage lasts longer. Rollback first, ask questions later. The logs, metrics, and state will still be there.

The Cognitive Traps¶

Incidents make your brain worse at its job. Knowing the traps helps you avoid them.

Anchoring bias¶

The first theory you hear becomes the anchor for everything that follows. "I think it's the database" at minute 2 means the team investigates the database for an hour — even when evidence points elsewhere.

Counter-measure: Write 3 hypotheses before investigating. Revisit them every 15 minutes.

Confirmation bias¶

You look for evidence supporting your theory and ignore contradicting evidence. "The deploy caused it" — you find one error near deploy time and ignore that the same errors existed yesterday.

Counter-measure: Ask "what would prove me wrong?" If you can't answer, you're confirming, not investigating.

Sunk cost fallacy¶

You've spent 45 minutes on a theory. You don't want to "waste" that time by switching. So you keep going despite weak evidence.

Counter-measure: 15-minute timer. After each 15 minutes: "What have I learned? Is this still the best path?" Past time is irrelevant to the correct next action.

The "just reboot it" impulse¶

Tempting because it's a known action with a predictable outcome. But it destroys evidence: process state, memory contents, connection state, the contents of /proc. And it masks the root cause — you'll be back at 3 AM next week.

When rebooting IS correct: Kernel panic. Unresponsive kernel. Hardware error requiring reinitialization. Documented bug with reboot-as-fix.

The HiPPO effect¶

The Highest Paid Person's Opinion. VP joins the bridge call, states a theory, and everyone stops their investigation to follow the VP's hunch. The VP hasn't seen any of the debugging data.

Counter-measure: IC runs the investigation. VP's role: remove blockers and authorize resources. Not technical direction.

Communication During Incidents¶

Internal (war room)¶

[2:52] IC: "Status check. @alice what have you found?"
[2:53] Alice: "Database connections look normal. I can rule out the DB."
[2:53] IC: "@bob, what about the deploy at 2:30?"
[2:54] Bob: "Deploy was a config change to logging. Doesn't touch request path.
             But I see the same error pattern started at 2:43, 13 minutes after deploy."
[2:55] IC: "Let's check what else changed at 2:43. @carol, check CloudTrail
            for any infra changes in that window."

Good incident communication: - Short, factual statements - Share findings, not theories (unless clearly labeled) - IC polls for updates — don't wait to be asked - "I can rule out X" is as valuable as "I found Y"

External (statuspage)¶

[2:55] Investigating: We are investigating elevated error rates on the API.
[3:15] Identified: The issue has been identified as [brief description].
       We are implementing a fix.
[3:45] Resolved: The issue has been resolved. [Brief explanation].
       We will publish a full incident report within 48 hours.

Rules: - Update every 15-30 minutes, even if it's "still investigating" - Never lie ("a small number of users" when 80% are affected) - Don't speculate on cause until confirmed - Promise and deliver a follow-up report

The Postmortem: Fixing Systems, Not People¶

After the incident is resolved, write a postmortem. The goal: prevent this class of incident from recurring. Not: assign blame.

The blameless approach¶

Engineers don't make mistakes because they're careless. They make mistakes because the system placed them in conditions where that mistake was the most likely outcome given the information they had.

Fix the person → one less person. Fix the system → prevent the entire class of future incidents.

Postmortem structure¶

Timeline — Minute-by-minute, factual, no judgment. "At 14:23, the engineer concluded from available metrics that the service was healthy" — not "the engineer incorrectly assumed."
Contributing factors (not "root cause") — Multiple systemic factors, not one human error. "Migration not tested against production-sized data" + "No timeout protection" + "Health check didn't verify database connectivity."
What went well — Acknowledge effective responses. Rollback was fast. Communication was clear. This encourages good behavior.
What went poorly — Detection was slow. Runbook was outdated. Escalation was delayed.
Action items — Specific, assigned, time-bounded. "Improve monitoring" is not actionable. "Add alert for connection pool utilization > 80%, assigned to @alice, due April 1" is.

Gotcha: The most common postmortem failure: excellent analysis, action items created, then nobody tracks them. Within 6 weeks, forgotten. Same incident recurs. Move action items to your issue tracker with owners and due dates — not a Google Doc nobody re-reads.

War Story: A team wrote 73 postmortems over 2 years with 219 action items. Completion rate: 15.5%. Eleven incidents repeated. Three repeated three times. In one case, a Redis split-brain postmortem documented the exact fix (change Sentinel quorum from 1 to 2). The action item sat unfinished for 3 months. The exact same split-brain happened again — 90 minutes of data loss, 42,000 users logged out. In the bridge call, someone read the January postmortem aloud. Silence. Then: "I thought we fixed that."

Flashcard Check¶

Q1: What does the IC do during an incident?

Coordinates, assigns tasks, polls for updates, makes decisions (rollback/escalate). Does NOT debug, SSH, or write queries.

Q2: AAVCE — what does each letter stand for?

Acknowledge (stop escalation), Assess (read the alert), Verify (second data source), Communicate (open incident channel), Escalate (if not fixable in 15 min).

Q3: What are the 3Rs of mitigation?

Rollback, Restart, Rescale — in that order. Restore service first, investigate second.

Q4: Anchoring bias — what is it and how do you counter it?

The first theory heard dominates all investigation. Counter: write 3 hypotheses before starting. Revisit every 15 minutes.

Q5: Why is "root cause" the wrong framing?

Incidents have multiple contributing factors, not one cause. "Root cause = human error" leads to blame. "Contributing factors = systemic gaps" leads to prevention.

Q6: Postmortem action items — what makes them actually get done?

Move to your issue tracker (Jira, Linear, GitHub Issues) with an owner, a due date, and SLA tracking. Not a Google Doc — those get abandoned.

Exercises¶

Exercise 1: Run a tabletop incident (team exercise)¶

Pick a scenario from the war stories in this training library. Set a 30-minute timer. Assign IC, Technical Lead, Communications. Work through AAVCE, the 3Rs, and write a mock statuspage update.

Debrief: What worked? What felt awkward? Where did roles overlap?

Exercise 2: Write a postmortem (practice)¶

Take this incident summary and write a postmortem:

At 3 PM, a database migration added an index to a 50-million-row table. The migration took 45 minutes. During that time, the table was locked. All queries to that table timed out. The API returned 503 errors. The on-call engineer was paged at 3:12 PM. They identified the lock at 3:25 PM and killed the migration. Service restored at 3:28 PM. The index was later added during a maintenance window with CREATE INDEX CONCURRENTLY.

One approach

**Timeline:** - 15:00 — Migration started (CREATE INDEX on users table, 50M rows) - 15:00 — Table locked (ACCESS EXCLUSIVE) - 15:02 — First query timeouts - 15:05 — Error rate crosses 5% - 15:12 — PagerDuty alert fires, on-call acknowledges - 15:15 — IC identifies 503 errors from API, suspects database - 15:25 — `pg_stat_activity` shows CREATE INDEX holding lock for 25 minutes - 15:26 — `SELECT pg_terminate_backend(pid)` kills the migration - 15:28 — Queries resume, error rate normalizes **Contributing factors:** 1. Migration not tested against production-sized table (staging had 10K rows) 2. No timeout on migration Jobs (ran indefinitely) 3. No pre-migration check for table size / estimated lock duration 4. Alert took 12 minutes to fire (for: 10m + scrape delay) **Action items:** - [ ] Add pre-migration table size check to migration tooling (@bob, Apr 5) - [ ] Use `CREATE INDEX CONCURRENTLY` for tables > 1M rows (@alice, Apr 1) - [ ] Add query timeout for migration Jobs: 10 minutes (@carol, Apr 5) - [ ] Reduce alert for: duration to 3m for 503 errors (@dave, Mar 28)

Cheat Sheet¶

AAVCE (First 60 Seconds)¶

Acknowledge the page
Assess the alert (service, symptom, severity)
Verify with second data source
Communicate (open incident channel, post status)
Escalate if not fixable in 15 minutes

3R Mitigation¶

Rollback (last deploy)
Restart (bounce service)
Rescale (add capacity)

Severity Quick Reference¶

SEV	Who's affected	Response
1	Everyone	All hands, statuspage, VP notified
2	>10% of users	Incident team, statuspage
3	Some users	On-call, internal comms
4	No users	Sprint work

Cognitive Trap Countermeasures¶

Trap	Counter
Anchoring	3 hypotheses before investigating
Confirmation	"What would prove me wrong?"
Sunk cost	15-minute re-evaluation timer
HiPPO	IC runs investigation, not VP

Takeaways¶

Structure beats heroics. AAVCE in the first 60 seconds prevents chaos. Roles prevent duplication. Communication prevents isolation.
Restore first, investigate second. The 3Rs (Rollback, Restart, Rescale) fix most incidents. Root cause analysis can happen after users are unblocked.
Your brain is worse during incidents. Anchoring, confirmation bias, sunk cost, and HiPPO all get worse under stress. Structured protocols compensate.
Blameless postmortems fix systems. Contributing factors, not root cause. Action items with owners, not aspirational improvement goals.
Track postmortem actions like bugs. If they're not in the issue tracker with deadlines, they won't get done. And the incident will repeat.

Prometheus and the Art of Not Alerting — what triggers the page
The Cascading Timeout — a common incident pattern
The Mysterious Latency Spike — how to investigate once the page fires