Skip to content

Incident Command & On-Call - Street-Level Ops

What experienced incident commanders and on-call engineers know about keeping their head when everything is on fire.

Quick Diagnosis Commands

# Check current on-call schedule (PagerDuty CLI)
pd oncall list --schedule-ids P123ABC

# List open incidents
pd incident list --statuses triggered,acknowledged

# Acknowledge an incident from CLI
pd incident ack --ids P456DEF

# Check who got paged in the last 24 hours (PagerDuty API)
curl -s -H "Authorization: Token token=YOUR_TOKEN" \
  "https://api.pagerduty.com/incidents?since=$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)&until=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  | jq '.incidents[] | {id, title, status, created_at}'

# Quick service health check during incident
kubectl get pods -n production --field-selector status.phase!=Running
kubectl top pods -n production --sort-by=cpu | head -10
curl -w "\n%{http_code} %{time_total}s\n" -o /dev/null -s https://api.example.com/health

# Check recent deploys (potential incident trigger)
kubectl rollout history deployment/api -n production | tail -5
git log --oneline --since="4 hours ago" --all

Gotcha: The IC Who Gets Pulled Into Debugging

Incident is declared. You're IC. The technical discussion gets interesting. You catch yourself looking at Grafana dashboards and suggesting queries. Twenty minutes pass. Nobody has updated the statuspage. Nobody has paged the database team. Nobody is running the incident because you're debugging.

Fix: The IC mantra: "I coordinate, I don't debug." The moment you feel the pull to investigate, assign that investigation to someone else. Your job is: Who is doing what? What's the status? Who needs to be notified? What's blocking progress? If nobody else is available to debug, hand off IC to someone else.

Gotcha: War Room With 30 People and No Structure

SEV-1 declared. Everyone joins the Slack channel. Fifteen people are talking simultaneously. Three different theories are being investigated in the same thread. Nobody knows who's in charge. The signal-to-noise ratio is zero.

Fix: IC immediately sets structure:

1. Pin the incident summary message with roles
2. "All investigation goes in threads, not main channel"
3. "Tech leads for each theory, report status every 10 minutes"
4. "If you're observing and not assigned a role, mute the channel"
5. Keep the main channel for IC updates, decisions, and escalations only

Gotcha: Escalation Delayed Because "I Almost Have It"

On-call engineer has been debugging for 45 minutes. They're convinced the fix is around the corner. Meanwhile, customer impact is growing. They finally escalate after 90 minutes, and the fresh eyes find the problem in 10 minutes.

Fix: Time-box your solo investigation. Hard rule: if you haven't made meaningful progress in 15 minutes, escalate. "Meaningful progress" means you've identified the failing component, not that you have a theory. The 15-minute rule saves hours.

Remember: Escalation timer mnemonic: 15-30-60 — 15 minutes of solo debugging before escalating to a second person, 30 minutes before declaring a formal incident, 60 minutes before escalating to management. Adjust for severity: for SEV-1 (full outage), cut all timers in half.

Gotcha: Handoff That's Just "Good Luck"

Friday 5pm. Outgoing on-call Slacks "you're on-call now, gl" and disappears. Incoming has no context on the three alerts that fired today, the maintenance window tomorrow, or the flaky test that's been causing false pages.

Fix: Mandatory written handoff. Block 30 minutes at rotation boundary. Cover: active issues, recent changes, known noise, upcoming events, and anything you wish you'd known at the start of your shift. If the outgoing person won't write it, the manager enforces it.

Gotcha: PagerDuty Escalation That Goes to a Slack Channel

Your escalation policy's Level 3 goes to a Slack channel instead of a specific person. At 3am, the Level 2 times out, and the "escalation" is a message in #engineering-managers that nobody sees until morning. The incident runs for 5 hours unattended.

Fix: Every escalation level must page a specific human with phone call delivery. Slack channels are not escalation targets. If your L3 is a manager, they accept the page contract — including 3am phone calls.

Pattern: The First Five Minutes

The first five minutes of incident response determine the trajectory. Here's the sequence:

Minute 0:    Alert fires. On-call acked.

Minute 1:    Quick triage: Is this real? Check the dashboard.
             Real → continue. False positive → resolve and document.

Minute 2:    Open incident channel: #inc-YYYYMMDD-short-name
             Post template message (severity, impact, roles, links).

Minute 3:    Check: was anything deployed in the last 2 hours?
             Yes → strong candidate for rollback.
             No  → begin systematic diagnosis.

Minute 4:    Assign roles if SEV-2+:
             IC (you or delegate), Tech Lead, Comms.

Minute 5:    First statuspage update: "Investigating [problem].
             Next update in 15 minutes."

Pattern: The Rollback Decision

Most incidents correlate with recent changes. The rollback decision framework:

Should I roll back?

  Was there a deploy in the last 4 hours?
    No  → Rollback won't help. Investigate other causes.
    Yes → Does the deploy correlate with symptom onset?
           No  → Probably coincidence. Investigate, but keep rollback ready.
           Yes → Can I roll back safely?
                  Yes → ROLL BACK NOW. Investigate after.
                  No  → (data migration, schema change)
                        → Investigate forward. Get help.

The golden rule: When in doubt, roll back.
Rollback is not a failure. It's the fastest mitigation.
Ego has no place in incident response.

One-liner: The single fastest way to restore service is almost always kubectl rollout undo deployment/<name> or reverting the last infrastructure change. Spend 2 minutes on rollback before spending 30 minutes on root cause analysis.

Pattern: Communication Cadence

SEV-1:
  - Internal update: every 10 minutes
  - Statuspage update: every 15 minutes
  - Exec update: every 30 minutes
  - Customer communication: first update within 15 minutes

SEV-2:
  - Internal update: every 15 minutes
  - Statuspage update: every 30 minutes
  - Exec update: only if > 1 hour duration

SEV-3:
  - Internal update: every 30 minutes
  - Statuspage update: if customer-visible, every hour
  - Exec update: not needed

Even if nothing has changed, post "Still investigating.
Current theory: [X]. Next step: [Y]. ETA: [Z]."
Silence breeds panic.

War story: During a 4-hour SEV-1, the IC went silent for 45 minutes while deep in investigation. The VP of Engineering assumed the incident was unattended and pulled in a second team, who started making conflicting changes. The resulting confusion extended the outage by another hour. Even "no update — still investigating" every 15 minutes prevents this spiral.

Pattern: On-Call Compensation That Works

Fair on-call compensation models:

  Model A: Flat stipend
    $500/week for being on-call
    + $100 per after-hours page
    Simple, predictable

  Model B: Comp time
    1 hour of comp time per page
    Half day off after any night page
    Full day off after a SEV-1 shift

  Model C: Hybrid
    $300/week stipend
    + comp time for night pages
    + bonus for months with < 3 pages (reward reliability work)

  Non-negotiable regardless of model:
    - On-call load must be distributed evenly
    - Nobody on-call more than 1 week per month (minimum 4-person rotation)
    - Engineers can swap shifts without manager approval
    - Post-SEV-1 recovery time is automatic, not requested

Emergency: On-Call Engineer Unreachable

Primary on-call hasn't acked in 10 minutes. Secondary hasn't acked in 5.

1. Check PagerDuty: did the page actually send? (Check delivery log)
2. If page sent: call primary's phone directly (bypass app)
3. If no answer: call secondary's phone directly
4. If no answer: page the engineering manager (L3 escalation)
5. If all else fails: the first engineer who sees the alert
   becomes the IC and owns the response
6. Post-incident: review why two people were unreachable
   - Were phones on DND? → Configure PagerDuty to bypass DND
   - Were they traveling? → On-call swaps must be arranged in advance
   - Was the page not sent? → Fix the integration

Emergency: SEV-1 During On-Call Handoff

Incident fires at the exact rotation boundary. Both outgoing and incoming are partially context-loaded.

1. The person who acked the page is IC (regardless of rotation schedule)
2. The other person becomes secondary/tech lead
3. Both stay engaged until the incident is resolved or stable
4. The handoff completes after the incident, not during
5. The outgoing person should NOT hand off mid-incident
   unless they're genuinely unable to continue

Rule: incidents take priority over rotation boundaries.
Clean handoffs happen in peacetime.