Portal | Level: L2: Operations | Topics: On-Call & Incident Command, Incident Response | Domain: DevOps & Tooling

Incident Command & On-Call - Primer¶

Why This Matters¶

Every organization that runs production systems will have incidents. The question isn't whether — it's whether your response will be structured or chaotic. Incident command is the discipline of turning a panicked Slack channel into a coordinated response with clear roles, communication, and escalation. Good incident command doesn't prevent outages, but it dramatically reduces their duration and blast radius.

On-call is the other half: who gets woken up, how fast they respond, and whether they burn out doing it. A bad on-call rotation destroys morale faster than almost anything else in engineering. A good one is a sustainable system that distributes load fairly and gives people the tools to respond effectively.

Core Concepts¶

Name origin: Incident Command System (ICS) was developed by California firefighters in the 1970s after disastrous wildfires exposed coordination failures between agencies. FIRESCOPE (Firefighting Resources of California Organized for Potential Emergencies) created ICS to standardize roles, communication, and escalation. Tech companies adopted ICS principles in the 2010s, led by PagerDuty and Google SRE. The core insight translates perfectly: in an emergency, unclear roles and ad-hoc communication cost lives (or uptime).

1. Incident Severity Levels¶

Before you can respond to incidents, everyone must agree on what constitutes one. Severity levels create a shared language.

Level	Customer Impact	Response	Communication	Example
SEV-1	Major: service down or data loss	All hands, war room, exec notified	Statuspage + customer email	Complete outage, data corruption
SEV-2	Significant: degraded for many users	On-call team + service owners	Statuspage update	50% error rate, 10x latency
SEV-3	Minor: small subset affected, workaround exists	On-call engineer	Internal Slack update	Single endpoint down, one region degraded
SEV-4	Minimal: no customer impact	Best effort, next business day	Team channel	Internal tool broken, monitoring gap

Severity Decision Tree:

  Is the service completely down?
    Yes → SEV-1
    No  → Are >10% of users affected without workaround?
           Yes → SEV-2
           No  → Are any users affected?
                  Yes → SEV-3
                  No  → SEV-4 (or not an incident)

2. Incident Commander Role¶

The Incident Commander (IC) runs the response. They do NOT debug the problem — they coordinate people, communication, and decisions.

IC Responsibilities:

  ┌─────────────────────────────────────────────────┐
  │ Incident Commander                               │
  │                                                   │
  │  □ Declare the incident and severity              │
  │  □ Open the war room (Slack channel + bridge)     │
  │  □ Assign roles (comms lead, technical lead)      │
  │  □ Set the investigation direction                │
  │  □ Decide when to escalate                        │
  │  □ Approve risky remediation actions              │
  │  □ Call for statuspage updates at regular cadence  │
  │  □ Declare resolution                             │
  │  □ Assign postmortem owner                        │
  │                                                   │
  │  The IC does NOT:                                 │
  │                                                   │
  │  Remember: "Coordinate, don't operate."           │
  │  × Debug code                                     │
  │  × SSH into servers                               │
  │  × Write queries                                  │
  │  × Get tunnel-visioned on one theory              │
  └─────────────────────────────────────────────────┘

3. Incident Roles¶

Role	Responsibility	Who
Incident Commander (IC)	Coordinates response, makes decisions	Senior on-call or designated IC
Technical Lead	Drives investigation and remediation	Engineer closest to the problem
Communications Lead	Updates statuspage, customers, stakeholders	On-call comms person or PM
Scribe	Records timeline, decisions, actions in real-time	Anyone available

War Room Structure:

  Slack: #inc-20260315-api-latency
  Bridge: Zoom/Meet link pinned in channel topic

  IC:          @alice (running the show)
  Tech Lead:   @bob (hands on keyboard)
  Comms:       @carol (statuspage + customer updates)
  Scribe:      @dave (timestamped notes in thread)

4. PagerDuty / OpsGenie Setup¶

Alerting tools route pages to the right person at the right time. The setup matters more than the tool choice.

Escalation Policy Structure:

  Level 1: Primary on-call
           → 5-minute ack window
           → If not acked: escalate

  Level 2: Secondary on-call
           → 5-minute ack window
           → If not acked: escalate

  Level 3: Engineering manager
           → 10-minute ack window
           → If not acked: escalate

  Level 4: VP Engineering (phone call)
           → This should almost never happen
           → If it does, your L1-L3 process is broken

Configuration	Recommended Setting	Why
Ack timeout	5 minutes	Long enough for someone to wake up, short enough to not waste time
Escalation timeout	5 minutes after ack timeout	Don't let a missed ack block the response
Auto-resolve	30 minutes after alert clears	Prevents stale incidents clogging the queue
De-duplication	5-minute window	Same alert firing 10x should be one page, not ten
Low-urgency hours	Business hours only	Non-critical alerts should not wake people up

War story: In 2017, GitLab experienced a major database outage where an engineer accidentally ran rm -rf on a production database directory during an incident response. Five out of five backup methods failed to produce a usable restore. The incident became a landmark case study in incident response because GitLab live-streamed the recovery on YouTube, published a brutally honest postmortem, and used the experience to completely rebuild their backup validation process. Key lesson: test your backups by actually restoring from them, not just verifying they exist. | High-urgency delivery | Push + SMS + phone call | Critical alerts must break through Do Not Disturb |

5. On-Call Rotations¶

Rotation Design:

  Primary:    [Alice] → [Bob] → [Carol] → [Dave] → [Alice] ...
  Secondary:  [Bob]   → [Carol] → [Dave] → [Alice] → [Bob] ...
              ↑ secondary is always next week's primary

  Shift length: 1 week (Mon 09:00 → Mon 09:00)
  Handoff: 30-minute overlap with written summary
  Coverage: 24/7 for SEV-1/2, business hours for SEV-3/4

On-Call Handoff Template¶

On-Call Handoff: [Date]

Outgoing: @alice
Incoming: @bob

Active issues:
  - Database replica lag intermittent (ticket #1234)
    Last occurrence: 2 hours ago, self-resolved
    Runbook: runbooks/db-replica-lag.md

  - Cert renewal for api.example.com due in 5 days
    Tracked in: ticket #1235

Recent changes:
  - Deployed v2.3.4 yesterday (new caching layer)
  - Redis cluster scaled from 3 → 5 nodes Tuesday

Watch items:
  - Marketing campaign Thursday may spike traffic 3x
  - Planned maintenance window Friday 02:00-04:00 UTC

Noise alerts to know about:
  - "disk_usage_high" on log-collector-03 fires every 6 hours
    Known issue, ticket #1200, safe to ack

6. Communication Templates¶

Statuspage Update — Investigating¶

[Investigating] Elevated error rates on API

We are investigating elevated error rates affecting the API.
Some users may experience slower response times or intermittent errors.

Our team is actively investigating the root cause.
We will provide an update within 30 minutes.

Posted at: 2026-03-15 14:05 UTC

Statuspage Update — Identified¶

[Identified] Elevated error rates on API

We have identified the root cause as a database connection pool
exhaustion following today's deployment.

We are rolling back the deployment and expect service to recover
within 15 minutes.

Next update in 15 minutes or upon resolution.

Posted at: 2026-03-15 14:20 UTC

Statuspage Update — Resolved¶

[Resolved] Elevated error rates on API

The deployment has been rolled back and service has fully recovered.
Error rates have returned to normal levels.

Duration: 14:00 - 14:35 UTC (35 minutes)
Impact: Approximately 15% of API requests returned errors during the window.

We will publish a detailed postmortem within 48 hours.

Posted at: 2026-03-15 14:40 UTC

Slack War Room Opening Message¶

:rotating_light: INCIDENT DECLARED: API latency spike
Severity: SEV-2
Impact: ~30% of requests > 5s latency
IC: @alice
Tech Lead: @bob
Comms: @carol

Bridge: [Zoom link]
Dashboard: [Grafana link]
Statuspage: [link]

Next update: 14:15 UTC

Thread all investigation in this channel.
Non-incident conversation → #engineering

7. Runbook-Driven Response¶

Every common failure mode should have a runbook that any on-call engineer can follow — even if they've never worked on that service before.

Runbook Template:

  Title: [Service] — [Failure Mode]
  Last Updated: YYYY-MM-DD
  Author: @name

  Symptoms:
    - What alerts fire
    - What users see
    - What dashboards show

  Diagnosis:
    Step 1: Check [specific dashboard/query]
    Step 2: Run [specific command]
    Step 3: If [condition], go to Remediation A
            If [other condition], go to Remediation B

  Remediation A: [Detailed steps]
    1. Run: kubectl rollout undo deployment/api -n production
    2. Verify: watch error rate dashboard for 5 minutes
    3. If not resolved: escalate to @service-owner

  Remediation B: [Detailed steps]
    ...

  Escalation:
    - If remediation fails after 15 minutes: page @secondary
    - If data loss suspected: page @database-team
    - If customer-facing > 30 minutes: notify @vp-eng

8. On-Call Health and Burnout Prevention¶

Warning Sign	Intervention
> 5 pages per on-call shift	Tune alerts, fix noisy sources
Engineer dreading their on-call week	Review alert volume, add secondary support
Same person always on-call (no swaps)	Enforce rotation, backfill the team
Pages during sleep (midnight - 6am)	Review: are these truly urgent? Can they wait?
On-call engineer fixing the same issue repeatedly	Invest in permanent fix, not band-aids
Post on-call exhaustion (needs recovery day)	Formalize comp time, reduce shift length

On-Call Health Metrics:

  Track monthly:
    - Pages per shift (target: < 2 per day shift, 0 per night)
    - Time-to-ack (target: < 5 minutes)
    - Time-to-resolve (target: < 30 minutes for P1)
    - Sleep interruptions per night shift
    - On-call satisfaction survey score (1-5, target: > 3.5)

  Red flags:
    - Pages per shift trending up
    - Same alerts recurring weekly
    - Satisfaction score below 3.0
    - Engineer requesting permanent removal from rotation

Common Pitfalls¶

Interview tip: When asked about incident response in interviews, walk through the lifecycle: Detect, Triage, Respond, Resolve, Learn. The strongest signal of experience is mentioning the communication cadence ("we updated stakeholders every 15 minutes") and the postmortem ("blameless, focused on systemic fixes"). Companies want to hear that you treat incidents as learning opportunities, not blame events.

IC who debugs instead of coordinating — The moment the IC starts SSHing into boxes, nobody is running the incident. Delegate technical work.
No communication cadence — Stakeholders hear nothing for an hour and assume the worst. Set a timer: update every 15-30 minutes even if the update is "still investigating."
Escalation as failure — Engineers wait too long to escalate because they think it makes them look incompetent. Reframe: escalation is the system working correctly.
On-call with no runbooks — Waking someone up at 3am and expecting them to figure it out from scratch is cruel and slow. Every page should have a corresponding runbook.
Handoff by disappearing — Outgoing on-call goes dark without telling incoming what's happening. Written handoffs are non-negotiable.
Alert routing to a Slack channel instead of a pager — Slack messages get lost in noise. Critical alerts must page a human directly with escalation.

Prerequisites¶

Postmortems & SLOs (Topic Pack, L2)

Runbook Craft (Topic Pack, L1) — Incident Response, On-Call & Incident Command
The Psychology of Incidents (Topic Pack, L2) — Incident Response, On-Call & Incident Command
Vendor Management & Escalation (Topic Pack, L1) — Incident Response, On-Call & Incident Command
Change Management (Topic Pack, L1) — Incident Response
Chaos Engineering Scripts (CLI) (Exercise Set, L2) — Incident Response
Debugging Methodology (Topic Pack, L1) — Incident Response
Incident Response Flashcards (CLI) (flashcard_deck, L1) — Incident Response
Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Incident Response
Investigation Engine (CLI) (Exercise Set, L2) — Incident Response
On Call Flashcards (CLI) (flashcard_deck, L1) — On-Call & Incident Command

Incident Command & On-Call - Primer¶

Why This Matters¶

Core Concepts¶

1. Incident Severity Levels¶

2. Incident Commander Role¶

3. Incident Roles¶

4. PagerDuty / OpsGenie Setup¶

5. On-Call Rotations¶

On-Call Handoff Template¶

6. Communication Templates¶

Statuspage Update — Investigating¶

Statuspage Update — Identified¶

Statuspage Update — Resolved¶

Slack War Room Opening Message¶

7. Runbook-Driven Response¶

8. On-Call Health and Burnout Prevention¶

Common Pitfalls¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

Incident Command & On-Call - Primer¶

Why This Matters¶

Core Concepts¶

1. Incident Severity Levels¶

2. Incident Commander Role¶

3. Incident Roles¶

4. PagerDuty / OpsGenie Setup¶

5. On-Call Rotations¶

On-Call Handoff Template¶

6. Communication Templates¶

Statuspage Update — Investigating¶

Statuspage Update — Identified¶

Statuspage Update — Resolved¶

Slack War Room Opening Message¶

7. Runbook-Driven Response¶

8. On-Call Health and Burnout Prevention¶

Common Pitfalls¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶