Skip to content

Portal | Level: L2: Operations | Topics: On-Call & Incident Command, Incident Response | Domain: DevOps & Tooling

Incident Command & On-Call - Primer

Why This Matters

Every organization that runs production systems will have incidents. The question isn't whether — it's whether your response will be structured or chaotic. Incident command is the discipline of turning a panicked Slack channel into a coordinated response with clear roles, communication, and escalation. Good incident command doesn't prevent outages, but it dramatically reduces their duration and blast radius.

On-call is the other half: who gets woken up, how fast they respond, and whether they burn out doing it. A bad on-call rotation destroys morale faster than almost anything else in engineering. A good one is a sustainable system that distributes load fairly and gives people the tools to respond effectively.

Core Concepts

Name origin: Incident Command System (ICS) was developed by California firefighters in the 1970s after disastrous wildfires exposed coordination failures between agencies. FIRESCOPE (Firefighting Resources of California Organized for Potential Emergencies) created ICS to standardize roles, communication, and escalation. Tech companies adopted ICS principles in the 2010s, led by PagerDuty and Google SRE. The core insight translates perfectly: in an emergency, unclear roles and ad-hoc communication cost lives (or uptime).

1. Incident Severity Levels

Before you can respond to incidents, everyone must agree on what constitutes one. Severity levels create a shared language.

Level Customer Impact Response Communication Example
SEV-1 Major: service down or data loss All hands, war room, exec notified Statuspage + customer email Complete outage, data corruption
SEV-2 Significant: degraded for many users On-call team + service owners Statuspage update 50% error rate, 10x latency
SEV-3 Minor: small subset affected, workaround exists On-call engineer Internal Slack update Single endpoint down, one region degraded
SEV-4 Minimal: no customer impact Best effort, next business day Team channel Internal tool broken, monitoring gap
Severity Decision Tree:

  Is the service completely down?
    Yes → SEV-1
    No  → Are >10% of users affected without workaround?
           Yes → SEV-2
           No  → Are any users affected?
                  Yes → SEV-3
                  No  → SEV-4 (or not an incident)

2. Incident Commander Role

The Incident Commander (IC) runs the response. They do NOT debug the problem — they coordinate people, communication, and decisions.

IC Responsibilities:

  ┌─────────────────────────────────────────────────┐
  │ Incident Commander                               │
  │                                                   │
  │  □ Declare the incident and severity              │
  │  □ Open the war room (Slack channel + bridge)     │
  │  □ Assign roles (comms lead, technical lead)      │
  │  □ Set the investigation direction                │
  │  □ Decide when to escalate                        │
  │  □ Approve risky remediation actions              │
  │  □ Call for statuspage updates at regular cadence  │
  │  □ Declare resolution                             │
  │  □ Assign postmortem owner                        │
  │                                                   │
  │  The IC does NOT:                                 │
  │                                                   │
  │  Remember: "Coordinate, don't operate."           │
  │  × Debug code                                     │
  │  × SSH into servers                               │
  │  × Write queries                                  │
  │  × Get tunnel-visioned on one theory              │
  └─────────────────────────────────────────────────┘

3. Incident Roles

Role Responsibility Who
Incident Commander (IC) Coordinates response, makes decisions Senior on-call or designated IC
Technical Lead Drives investigation and remediation Engineer closest to the problem
Communications Lead Updates statuspage, customers, stakeholders On-call comms person or PM
Scribe Records timeline, decisions, actions in real-time Anyone available
War Room Structure:

  Slack: #inc-20260315-api-latency
  Bridge: Zoom/Meet link pinned in channel topic

  IC:          @alice (running the show)
  Tech Lead:   @bob (hands on keyboard)
  Comms:       @carol (statuspage + customer updates)
  Scribe:      @dave (timestamped notes in thread)

4. PagerDuty / OpsGenie Setup

Alerting tools route pages to the right person at the right time. The setup matters more than the tool choice.

Escalation Policy Structure:

  Level 1: Primary on-call
           → 5-minute ack window
           → If not acked: escalate

  Level 2: Secondary on-call
           → 5-minute ack window
           → If not acked: escalate

  Level 3: Engineering manager
           → 10-minute ack window
           → If not acked: escalate

  Level 4: VP Engineering (phone call)
           → This should almost never happen
           → If it does, your L1-L3 process is broken
Configuration Recommended Setting Why
Ack timeout 5 minutes Long enough for someone to wake up, short enough to not waste time
Escalation timeout 5 minutes after ack timeout Don't let a missed ack block the response
Auto-resolve 30 minutes after alert clears Prevents stale incidents clogging the queue
De-duplication 5-minute window Same alert firing 10x should be one page, not ten
Low-urgency hours Business hours only Non-critical alerts should not wake people up

War story: In 2017, GitLab experienced a major database outage where an engineer accidentally ran rm -rf on a production database directory during an incident response. Five out of five backup methods failed to produce a usable restore. The incident became a landmark case study in incident response because GitLab live-streamed the recovery on YouTube, published a brutally honest postmortem, and used the experience to completely rebuild their backup validation process. Key lesson: test your backups by actually restoring from them, not just verifying they exist. | High-urgency delivery | Push + SMS + phone call | Critical alerts must break through Do Not Disturb |

5. On-Call Rotations

Rotation Design:

  Primary:    [Alice] → [Bob] → [Carol] → [Dave] → [Alice] ...
  Secondary:  [Bob]   → [Carol] → [Dave] → [Alice] → [Bob] ...
              ↑ secondary is always next week's primary

  Shift length: 1 week (Mon 09:00 → Mon 09:00)
  Handoff: 30-minute overlap with written summary
  Coverage: 24/7 for SEV-1/2, business hours for SEV-3/4

On-Call Handoff Template

On-Call Handoff: [Date]

Outgoing: @alice
Incoming: @bob

Active issues:
  - Database replica lag intermittent (ticket #1234)
    Last occurrence: 2 hours ago, self-resolved
    Runbook: runbooks/db-replica-lag.md

  - Cert renewal for api.example.com due in 5 days
    Tracked in: ticket #1235

Recent changes:
  - Deployed v2.3.4 yesterday (new caching layer)
  - Redis cluster scaled from 3 → 5 nodes Tuesday

Watch items:
  - Marketing campaign Thursday may spike traffic 3x
  - Planned maintenance window Friday 02:00-04:00 UTC

Noise alerts to know about:
  - "disk_usage_high" on log-collector-03 fires every 6 hours
    Known issue, ticket #1200, safe to ack

6. Communication Templates

Statuspage Update — Investigating

[Investigating] Elevated error rates on API

We are investigating elevated error rates affecting the API.
Some users may experience slower response times or intermittent errors.

Our team is actively investigating the root cause.
We will provide an update within 30 minutes.

Posted at: 2026-03-15 14:05 UTC

Statuspage Update — Identified

[Identified] Elevated error rates on API

We have identified the root cause as a database connection pool
exhaustion following today's deployment.

We are rolling back the deployment and expect service to recover
within 15 minutes.

Next update in 15 minutes or upon resolution.

Posted at: 2026-03-15 14:20 UTC

Statuspage Update — Resolved

[Resolved] Elevated error rates on API

The deployment has been rolled back and service has fully recovered.
Error rates have returned to normal levels.

Duration: 14:00 - 14:35 UTC (35 minutes)
Impact: Approximately 15% of API requests returned errors during the window.

We will publish a detailed postmortem within 48 hours.

Posted at: 2026-03-15 14:40 UTC

Slack War Room Opening Message

:rotating_light: INCIDENT DECLARED: API latency spike
Severity: SEV-2
Impact: ~30% of requests > 5s latency
IC: @alice
Tech Lead: @bob
Comms: @carol

Bridge: [Zoom link]
Dashboard: [Grafana link]
Statuspage: [link]

Next update: 14:15 UTC

Thread all investigation in this channel.
Non-incident conversation → #engineering

7. Runbook-Driven Response

Every common failure mode should have a runbook that any on-call engineer can follow — even if they've never worked on that service before.

Runbook Template:

  Title: [Service] — [Failure Mode]
  Last Updated: YYYY-MM-DD
  Author: @name

  Symptoms:
    - What alerts fire
    - What users see
    - What dashboards show

  Diagnosis:
    Step 1: Check [specific dashboard/query]
    Step 2: Run [specific command]
    Step 3: If [condition], go to Remediation A
            If [other condition], go to Remediation B

  Remediation A: [Detailed steps]
    1. Run: kubectl rollout undo deployment/api -n production
    2. Verify: watch error rate dashboard for 5 minutes
    3. If not resolved: escalate to @service-owner

  Remediation B: [Detailed steps]
    ...

  Escalation:
    - If remediation fails after 15 minutes: page @secondary
    - If data loss suspected: page @database-team
    - If customer-facing > 30 minutes: notify @vp-eng

8. On-Call Health and Burnout Prevention

Warning Sign Intervention
> 5 pages per on-call shift Tune alerts, fix noisy sources
Engineer dreading their on-call week Review alert volume, add secondary support
Same person always on-call (no swaps) Enforce rotation, backfill the team
Pages during sleep (midnight - 6am) Review: are these truly urgent? Can they wait?
On-call engineer fixing the same issue repeatedly Invest in permanent fix, not band-aids
Post on-call exhaustion (needs recovery day) Formalize comp time, reduce shift length
On-Call Health Metrics:

  Track monthly:
    - Pages per shift (target: < 2 per day shift, 0 per night)
    - Time-to-ack (target: < 5 minutes)
    - Time-to-resolve (target: < 30 minutes for P1)
    - Sleep interruptions per night shift
    - On-call satisfaction survey score (1-5, target: > 3.5)

  Red flags:
    - Pages per shift trending up
    - Same alerts recurring weekly
    - Satisfaction score below 3.0
    - Engineer requesting permanent removal from rotation

Common Pitfalls

Interview tip: When asked about incident response in interviews, walk through the lifecycle: Detect, Triage, Respond, Resolve, Learn. The strongest signal of experience is mentioning the communication cadence ("we updated stakeholders every 15 minutes") and the postmortem ("blameless, focused on systemic fixes"). Companies want to hear that you treat incidents as learning opportunities, not blame events.

  1. IC who debugs instead of coordinating — The moment the IC starts SSHing into boxes, nobody is running the incident. Delegate technical work.
  2. No communication cadence — Stakeholders hear nothing for an hour and assume the worst. Set a timer: update every 15-30 minutes even if the update is "still investigating."
  3. Escalation as failure — Engineers wait too long to escalate because they think it makes them look incompetent. Reframe: escalation is the system working correctly.
  4. On-call with no runbooks — Waking someone up at 3am and expecting them to figure it out from scratch is cruel and slow. Every page should have a corresponding runbook.
  5. Handoff by disappearing — Outgoing on-call goes dark without telling incoming what's happening. Written handoffs are non-negotiable.
  6. Alert routing to a Slack channel instead of a pager — Slack messages get lost in noise. Critical alerts must page a human directly with escalation.

Wiki Navigation

Prerequisites

  • Runbook Craft (Topic Pack, L1) — Incident Response, On-Call & Incident Command
  • The Psychology of Incidents (Topic Pack, L2) — Incident Response, On-Call & Incident Command
  • Vendor Management & Escalation (Topic Pack, L1) — Incident Response, On-Call & Incident Command
  • Change Management (Topic Pack, L1) — Incident Response
  • Chaos Engineering Scripts (CLI) (Exercise Set, L2) — Incident Response
  • Debugging Methodology (Topic Pack, L1) — Incident Response
  • Incident Response Flashcards (CLI) (flashcard_deck, L1) — Incident Response
  • Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Incident Response
  • Investigation Engine (CLI) (Exercise Set, L2) — Incident Response
  • On Call Flashcards (CLI) (flashcard_deck, L1) — On-Call & Incident Command