Portal | Level: L2: Operations | Topics: On-Call & Incident Command, Incident Response | Domain: DevOps & Tooling
Incident Command & On-Call - Primer¶
Why This Matters¶
Every organization that runs production systems will have incidents. The question isn't whether — it's whether your response will be structured or chaotic. Incident command is the discipline of turning a panicked Slack channel into a coordinated response with clear roles, communication, and escalation. Good incident command doesn't prevent outages, but it dramatically reduces their duration and blast radius.
On-call is the other half: who gets woken up, how fast they respond, and whether they burn out doing it. A bad on-call rotation destroys morale faster than almost anything else in engineering. A good one is a sustainable system that distributes load fairly and gives people the tools to respond effectively.
Core Concepts¶
Name origin: Incident Command System (ICS) was developed by California firefighters in the 1970s after disastrous wildfires exposed coordination failures between agencies. FIRESCOPE (Firefighting Resources of California Organized for Potential Emergencies) created ICS to standardize roles, communication, and escalation. Tech companies adopted ICS principles in the 2010s, led by PagerDuty and Google SRE. The core insight translates perfectly: in an emergency, unclear roles and ad-hoc communication cost lives (or uptime).
1. Incident Severity Levels¶
Before you can respond to incidents, everyone must agree on what constitutes one. Severity levels create a shared language.
| Level | Customer Impact | Response | Communication | Example |
|---|---|---|---|---|
| SEV-1 | Major: service down or data loss | All hands, war room, exec notified | Statuspage + customer email | Complete outage, data corruption |
| SEV-2 | Significant: degraded for many users | On-call team + service owners | Statuspage update | 50% error rate, 10x latency |
| SEV-3 | Minor: small subset affected, workaround exists | On-call engineer | Internal Slack update | Single endpoint down, one region degraded |
| SEV-4 | Minimal: no customer impact | Best effort, next business day | Team channel | Internal tool broken, monitoring gap |
Severity Decision Tree:
Is the service completely down?
Yes → SEV-1
No → Are >10% of users affected without workaround?
Yes → SEV-2
No → Are any users affected?
Yes → SEV-3
No → SEV-4 (or not an incident)
2. Incident Commander Role¶
The Incident Commander (IC) runs the response. They do NOT debug the problem — they coordinate people, communication, and decisions.
IC Responsibilities:
┌─────────────────────────────────────────────────┐
│ Incident Commander │
│ │
│ □ Declare the incident and severity │
│ □ Open the war room (Slack channel + bridge) │
│ □ Assign roles (comms lead, technical lead) │
│ □ Set the investigation direction │
│ □ Decide when to escalate │
│ □ Approve risky remediation actions │
│ □ Call for statuspage updates at regular cadence │
│ □ Declare resolution │
│ □ Assign postmortem owner │
│ │
│ The IC does NOT: │
│ │
│ Remember: "Coordinate, don't operate." │
│ × Debug code │
│ × SSH into servers │
│ × Write queries │
│ × Get tunnel-visioned on one theory │
└─────────────────────────────────────────────────┘
3. Incident Roles¶
| Role | Responsibility | Who |
|---|---|---|
| Incident Commander (IC) | Coordinates response, makes decisions | Senior on-call or designated IC |
| Technical Lead | Drives investigation and remediation | Engineer closest to the problem |
| Communications Lead | Updates statuspage, customers, stakeholders | On-call comms person or PM |
| Scribe | Records timeline, decisions, actions in real-time | Anyone available |
War Room Structure:
Slack: #inc-20260315-api-latency
Bridge: Zoom/Meet link pinned in channel topic
IC: @alice (running the show)
Tech Lead: @bob (hands on keyboard)
Comms: @carol (statuspage + customer updates)
Scribe: @dave (timestamped notes in thread)
4. PagerDuty / OpsGenie Setup¶
Alerting tools route pages to the right person at the right time. The setup matters more than the tool choice.
Escalation Policy Structure:
Level 1: Primary on-call
→ 5-minute ack window
→ If not acked: escalate
Level 2: Secondary on-call
→ 5-minute ack window
→ If not acked: escalate
Level 3: Engineering manager
→ 10-minute ack window
→ If not acked: escalate
Level 4: VP Engineering (phone call)
→ This should almost never happen
→ If it does, your L1-L3 process is broken
| Configuration | Recommended Setting | Why |
|---|---|---|
| Ack timeout | 5 minutes | Long enough for someone to wake up, short enough to not waste time |
| Escalation timeout | 5 minutes after ack timeout | Don't let a missed ack block the response |
| Auto-resolve | 30 minutes after alert clears | Prevents stale incidents clogging the queue |
| De-duplication | 5-minute window | Same alert firing 10x should be one page, not ten |
| Low-urgency hours | Business hours only | Non-critical alerts should not wake people up |
War story: In 2017, GitLab experienced a major database outage where an engineer accidentally ran
rm -rfon a production database directory during an incident response. Five out of five backup methods failed to produce a usable restore. The incident became a landmark case study in incident response because GitLab live-streamed the recovery on YouTube, published a brutally honest postmortem, and used the experience to completely rebuild their backup validation process. Key lesson: test your backups by actually restoring from them, not just verifying they exist. | High-urgency delivery | Push + SMS + phone call | Critical alerts must break through Do Not Disturb |
5. On-Call Rotations¶
Rotation Design:
Primary: [Alice] → [Bob] → [Carol] → [Dave] → [Alice] ...
Secondary: [Bob] → [Carol] → [Dave] → [Alice] → [Bob] ...
↑ secondary is always next week's primary
Shift length: 1 week (Mon 09:00 → Mon 09:00)
Handoff: 30-minute overlap with written summary
Coverage: 24/7 for SEV-1/2, business hours for SEV-3/4
On-Call Handoff Template¶
On-Call Handoff: [Date]
Outgoing: @alice
Incoming: @bob
Active issues:
- Database replica lag intermittent (ticket #1234)
Last occurrence: 2 hours ago, self-resolved
Runbook: runbooks/db-replica-lag.md
- Cert renewal for api.example.com due in 5 days
Tracked in: ticket #1235
Recent changes:
- Deployed v2.3.4 yesterday (new caching layer)
- Redis cluster scaled from 3 → 5 nodes Tuesday
Watch items:
- Marketing campaign Thursday may spike traffic 3x
- Planned maintenance window Friday 02:00-04:00 UTC
Noise alerts to know about:
- "disk_usage_high" on log-collector-03 fires every 6 hours
Known issue, ticket #1200, safe to ack
6. Communication Templates¶
Statuspage Update — Investigating¶
[Investigating] Elevated error rates on API
We are investigating elevated error rates affecting the API.
Some users may experience slower response times or intermittent errors.
Our team is actively investigating the root cause.
We will provide an update within 30 minutes.
Posted at: 2026-03-15 14:05 UTC
Statuspage Update — Identified¶
[Identified] Elevated error rates on API
We have identified the root cause as a database connection pool
exhaustion following today's deployment.
We are rolling back the deployment and expect service to recover
within 15 minutes.
Next update in 15 minutes or upon resolution.
Posted at: 2026-03-15 14:20 UTC
Statuspage Update — Resolved¶
[Resolved] Elevated error rates on API
The deployment has been rolled back and service has fully recovered.
Error rates have returned to normal levels.
Duration: 14:00 - 14:35 UTC (35 minutes)
Impact: Approximately 15% of API requests returned errors during the window.
We will publish a detailed postmortem within 48 hours.
Posted at: 2026-03-15 14:40 UTC
Slack War Room Opening Message¶
:rotating_light: INCIDENT DECLARED: API latency spike
Severity: SEV-2
Impact: ~30% of requests > 5s latency
IC: @alice
Tech Lead: @bob
Comms: @carol
Bridge: [Zoom link]
Dashboard: [Grafana link]
Statuspage: [link]
Next update: 14:15 UTC
Thread all investigation in this channel.
Non-incident conversation → #engineering
7. Runbook-Driven Response¶
Every common failure mode should have a runbook that any on-call engineer can follow — even if they've never worked on that service before.
Runbook Template:
Title: [Service] — [Failure Mode]
Last Updated: YYYY-MM-DD
Author: @name
Symptoms:
- What alerts fire
- What users see
- What dashboards show
Diagnosis:
Step 1: Check [specific dashboard/query]
Step 2: Run [specific command]
Step 3: If [condition], go to Remediation A
If [other condition], go to Remediation B
Remediation A: [Detailed steps]
1. Run: kubectl rollout undo deployment/api -n production
2. Verify: watch error rate dashboard for 5 minutes
3. If not resolved: escalate to @service-owner
Remediation B: [Detailed steps]
...
Escalation:
- If remediation fails after 15 minutes: page @secondary
- If data loss suspected: page @database-team
- If customer-facing > 30 minutes: notify @vp-eng
8. On-Call Health and Burnout Prevention¶
| Warning Sign | Intervention |
|---|---|
| > 5 pages per on-call shift | Tune alerts, fix noisy sources |
| Engineer dreading their on-call week | Review alert volume, add secondary support |
| Same person always on-call (no swaps) | Enforce rotation, backfill the team |
| Pages during sleep (midnight - 6am) | Review: are these truly urgent? Can they wait? |
| On-call engineer fixing the same issue repeatedly | Invest in permanent fix, not band-aids |
| Post on-call exhaustion (needs recovery day) | Formalize comp time, reduce shift length |
On-Call Health Metrics:
Track monthly:
- Pages per shift (target: < 2 per day shift, 0 per night)
- Time-to-ack (target: < 5 minutes)
- Time-to-resolve (target: < 30 minutes for P1)
- Sleep interruptions per night shift
- On-call satisfaction survey score (1-5, target: > 3.5)
Red flags:
- Pages per shift trending up
- Same alerts recurring weekly
- Satisfaction score below 3.0
- Engineer requesting permanent removal from rotation
Common Pitfalls¶
Interview tip: When asked about incident response in interviews, walk through the lifecycle: Detect, Triage, Respond, Resolve, Learn. The strongest signal of experience is mentioning the communication cadence ("we updated stakeholders every 15 minutes") and the postmortem ("blameless, focused on systemic fixes"). Companies want to hear that you treat incidents as learning opportunities, not blame events.
- IC who debugs instead of coordinating — The moment the IC starts SSHing into boxes, nobody is running the incident. Delegate technical work.
- No communication cadence — Stakeholders hear nothing for an hour and assume the worst. Set a timer: update every 15-30 minutes even if the update is "still investigating."
- Escalation as failure — Engineers wait too long to escalate because they think it makes them look incompetent. Reframe: escalation is the system working correctly.
- On-call with no runbooks — Waking someone up at 3am and expecting them to figure it out from scratch is cruel and slow. Every page should have a corresponding runbook.
- Handoff by disappearing — Outgoing on-call goes dark without telling incoming what's happening. Written handoffs are non-negotiable.
- Alert routing to a Slack channel instead of a pager — Slack messages get lost in noise. Critical alerts must page a human directly with escalation.
Wiki Navigation¶
Prerequisites¶
- Postmortems & SLOs (Topic Pack, L2)
Related Content¶
- Runbook Craft (Topic Pack, L1) — Incident Response, On-Call & Incident Command
- The Psychology of Incidents (Topic Pack, L2) — Incident Response, On-Call & Incident Command
- Vendor Management & Escalation (Topic Pack, L1) — Incident Response, On-Call & Incident Command
- Change Management (Topic Pack, L1) — Incident Response
- Chaos Engineering Scripts (CLI) (Exercise Set, L2) — Incident Response
- Debugging Methodology (Topic Pack, L1) — Incident Response
- Incident Response Flashcards (CLI) (flashcard_deck, L1) — Incident Response
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Incident Response
- Investigation Engine (CLI) (Exercise Set, L2) — Incident Response
- On Call Flashcards (CLI) (flashcard_deck, L1) — On-Call & Incident Command
Pages that link here¶
- Anti-Primer: Incident Command
- Change Management
- Comparison: Alerting & Paging
- Debugging Methodology
- Incident Command & On-Call
- Master Curriculum: 40 Weeks
- Postmortems & SLOs
- Production Readiness Review: Answer Key
- Production Readiness Review: Study Plans
- Runbook Craft
- The Psychology of Incidents
- Vendor Management & Escalation