Incident Command & On-Call Footguns¶

Mistakes that turn your incident response into a circus, your on-call into a burnout machine, and your communication into panic.

1. Everyone joins the war room and nobody leads¶

SEV-1 fires. Fifteen engineers pile into the Slack channel. Everyone starts investigating their own theory. Three people SSH into the same server. Two people attempt conflicting remediations simultaneously. Forty-five minutes later, someone asks "wait, who's running this?"

Fix: First person to respond declares themselves IC and posts the incident template. If you're not the IC, don't act without coordination. IC assigns investigation threads to specific people. No freelancing.

War story: During the 2017 S3 outage at AWS, the initial response was delayed partly because multiple teams investigated independently without a single IC coordinating. The incident lasted nearly 5 hours — longer than the underlying technical failure warranted.

2. Statuspage silence during a major outage¶

Your service has been down for 40 minutes. Customers are tweeting. Your support team is fielding calls saying "we don't know anything." Nobody updated the statuspage because everyone is focused on fixing the problem. Customer trust is now damaged beyond what the outage itself caused.

Fix: Assign a Communications Lead in the first 5 minutes. Their only job is external updates. First statuspage update within 10 minutes of incident declaration, then every 15 minutes. "We are investigating" is better than silence.

Under the hood: The "silence penalty" is real and measurable. Statuspage.io data shows that companies that update within 10 minutes of an incident lose 50% less customer trust than those that take 30+ minutes. The content matters less than the timing — "We are aware of an issue affecting X and are investigating" posted promptly reduces support ticket volume by 30-40%.

3. PagerDuty routing to email¶

Your alerting integration sends critical alerts to email. The on-call engineer has 2,000 unread emails. The page arrives at 2am and sits in the inbox until 8am. Six hours of undetected outage.

Fix: Critical alerts must use push notification + SMS + phone call. Never email for anything that needs a response in minutes. Configure escalation policies that call a phone if the push notification isn't acked within 5 minutes.

Default trap: PagerDuty's default notification rules for new users start with email and push only — no SMS, no phone call. Every new on-call engineer must manually configure their notification rules, or they'll miss 3 AM pages.

4. On-call rotation with two people¶

Your team has two engineers sharing on-call. Each person is on-call every other week. After three months, both are exhausted and resentful. One quits. The other is now on-call 24/7 indefinitely.

Fix: Minimum viable on-call rotation is four people (one week on, three weeks off). If your team is smaller than four, share on-call across teams or use a follow-the-sun model. Two-person rotations are a staffing emergency, not a plan.

5. No escalation criteria defined¶

On-call engineer has been fighting an issue for two hours alone. They don't want to "bother" anyone. The issue is a SEV-1 affecting thousands of users. By the time they escalate, the damage is massive and avoidable.

Fix: Escalation criteria must be explicit and time-boxed. "If you can't identify root cause in 15 minutes, page secondary. If customer impact exceeds 30 minutes, page the manager." Write these rules down. Repeat them at every on-call handoff. Escalation is not weakness.

6. Runbook says "contact the team" with no names¶

Your runbook for database failover says "Contact the database team for assistance." It's 3am. Who is on the database team? What's their PagerDuty schedule? Are they even on-call? The on-call engineer spends 20 minutes finding a name while the database is down.

Fix: Runbooks must contain specific escalation targets: PagerDuty service name, schedule ID, or specific humans with phone numbers. "Contact the database team" is not actionable at 3am. "@alice (primary) or @bob (secondary) via PagerDuty service DB-Oncall" is actionable.

7. Investigating the symptom, not checking for recent changes¶

Alert fires: high error rate. On-call dives into log analysis, database query plans, network traces. Forty minutes later, someone asks "did anyone deploy today?" Yes. Deployed 30 minutes before the alert. Rollback takes 3 minutes. Total wasted investigation time: 37 minutes.

Fix: First question in every incident: "Was anything deployed or changed in the last 4 hours?" Check deploy history before deep-diving into root cause. The majority of incidents correlate with recent changes. Rollback first, investigate second.

Remember: Google SRE data shows ~70% of production incidents are caused by changes to a running system. The "three questions" checklist: (1) What changed? (2) When? (3) Can we roll it back? answers most SEV-1s faster than any debugging tool.

8. On-call with no access to fix things¶

On-call engineer gets paged at 2am. They can see the problem in monitoring but don't have permissions to restart the service, access the database, or deploy a fix. They spend 30 minutes trying to find someone with the right access.

Fix: On-call engineers must have break-glass access to every system they're responsible for. Pre-provision credentials, VPN access, SSH keys, and cloud console access before the rotation starts. Test access during the handoff, not during the incident.

Gotcha: Cloud IAM session tokens (AWS STS) expire after 1-12 hours. If your break-glass procedure relies on pre-generated temporary credentials, they may be expired when the 2 AM page arrives. Use role assumption with long-lived base credentials, not pre-generated session tokens.

9. Post-incident review that never happens¶

Incident resolves. Everyone goes back to feature work. The postmortem is "scheduled for next week." Next week it's bumped. Then forgotten. Three months later, the same failure happens. Nobody remembers the first incident clearly enough to write a useful retrospective.

Fix: Postmortem draft within 48 hours. Review meeting within one week. No exceptions. Put it on the calendar during incident resolution, not after. The IC's last act is scheduling the postmortem review and assigning the author.

Remember: John Allspaw's research at Etsy showed that the most valuable part of a postmortem isn't the action items — it's the narrative. How did people make decisions with the information they had? What did the system look like from the operator's perspective? This narrative is the organizational learning. Action items without narrative are just a task list.

10. Treating on-call as free labor¶

Engineers are on-call 24/7 with no additional compensation, no comp time, and no reduction in sprint commitments. They're expected to deliver the same feature output while also responding to 3am pages. The best engineers transfer to teams without on-call. Your on-call rotation is now staffed by the people who couldn't leave.

Fix: On-call must be compensated — stipend, comp time, or reduced sprint load. After a night page, the next day is light duty. On-call load factors into sprint planning: on-call engineers commit to less. This isn't a perk; it's recognizing that on-call is real work.