Pattern: Missing Escalation Criteria¶
ID: FP-051 Family: Human Error Amplifier Frequency: Very Common Blast Radius: Incident (extended duration) Detection Difficulty: Obvious (in postmortem)
The Shape¶
An on-call engineer is paged for an incident. They work alone to diagnose and fix it, not wanting to "bother" colleagues at 3am. Without clear escalation criteria ("if you have not made progress in 30 minutes, page your manager"), the engineer fights a losing battle for hours. A problem that required 2 people with specialized knowledge takes 1 person with general knowledge much longer — or never gets resolved. The "culture of not wanting to escalate" is a system design failure, not a personal failing.
How You'll See It¶
In Incident Command¶
Incident timeline: paged at 2am. Engineer works alone until 4:30am (2.5 hours). No progress. Escalates. Specialist joins at 4:45am. Issue resolved at 5:15am (30 minutes with specialist). Total impact: 3 hours 15 minutes. With proper escalation at 30 minutes: impact would have been ~75 minutes. The extra 2 hours of customer impact was preventable.
In Kubernetes¶
A Kubernetes networking issue requires CNI knowledge. The on-call engineer is a generalist. Without escalation criteria, they debug networking for 2 hours (checking pods, checking services, checking DNS) without touching the CNI layer. A CNI specialist would have identified the issue in 10 minutes.
In CI/CD¶
A build system outage requires infra knowledge. The on-call developer engineer tries to restart services for 1 hour. Without escalation criteria, they don't call the infra team. The infra team has monitoring showing the same issue but doesn't know anyone is fighting it (incident channel wasn't created).
The Tell¶
Incident postmortem shows 1-2 hours of single-engineer work before escalation. The escalating point was "when the engineer gave up" not "when a predefined criterion was met." The issue was resolved quickly after escalation. The runbook had no section on when or how to escalate.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Engineer incompetence | System design failure (no escalation path) | The engineer did everything right; the process gave them no guidance on when to get help |
| Slow incident response | Missing escalation criteria | Response was fast; escalation was delayed by lack of process |
The Fix (Generic)¶
- Immediate: Escalate now; create an incident channel; invite anyone who might be relevant.
- Short-term: Add escalation criteria to all runbooks: "If not resolved within 30 minutes, escalate to [team/person]." Define severity levels with automatic escalation at each level.
- Long-term: Define SEV levels (SEV1: data loss/full outage, SEV2: degraded service, SEV3: partial impact); each level has automatic escalation timers; the incident management system (PagerDuty) escalates automatically if the incident isn't acknowledged at each level.
Real-World Examples¶
- Example 1: DBA on-call worked a Postgres issue alone for 3 hours at 3am. The issue required OS-level expertise (disk I/O stall, FP-008). A simple escalation to the infrastructure team would have identified the disk issue in 15 minutes.
- Example 2: Kubernetes networking issue. On-call engineer didn't escalate "because it seemed solvable." After 90 minutes, escalated to CNI expert. Issue resolved in 20 minutes. Total downtime: 110 minutes. With 30-minute escalation rule: 50 minutes.
War Story¶
I got paged at 11pm. I was convinced I could fix it. I really was. I worked alone until 3:30am (4.5 hours). Didn't escalate because "I didn't want to wake anyone." At 3:30am, I was out of ideas and paged my manager. My manager paged the database specialist. The specialist joined at 4am and fixed it in 22 minutes. Total downtime: 5 hours. With a 30-minute escalation rule: would have been under 90 minutes. The "culture of not wanting to escalate" is learned — you get positive reinforcement when you fix things alone. It's only in postmortems that you see the hidden cost. Our runbooks now say: "If you have not made measurable progress in 30 minutes, you MUST escalate. This is not optional."
Cross-References¶
- Topic Packs: incident-command
- Footguns: incident-command/footguns.md — "Missing alert escalation criteria"
- Related Patterns: FP-050 (runbook no contacts — often the reason escalation doesn't happen), FP-049 (port-forward as fix — same "incident prematurely closed" pattern)