- devops
- l1
- topic-pack
- incident-triage --- Portal | Level: L1: Foundations | Topics: Incident Triage | Domain: DevOps & Tooling
Incident Triage Primer¶
Why This Matters¶
When an alert fires at 3 AM, the first five minutes determine whether you resolve the incident in 30 minutes or 3 hours. Triage is the disciplined process of assessing severity, identifying blast radius, communicating status, and routing to the right responder. Bad triage turns a P3 into a P1 through panic, and lets a real P1 sit unnoticed because nobody assessed impact.
Severity Classification¶
Fun fact: The incident management discipline originated in emergency services (fire, medical). The Incident Command System (ICS) was developed by California firefighters in the 1970s after catastrophic failures in coordinating responses to wildfires. Tech adopted ICS terminology — Incident Commander, Communications Lead — from this lineage. Google's SRE book formalized it for software operations.
Standard Severity Levels¶
| Level | Impact | Examples | Response Time |
|---|---|---|---|
| SEV-1 / P1 | Complete outage, data loss, security breach | Production down, data exfiltration, full service unavailable | Immediate (all hands) |
| SEV-2 / P2 | Major degradation, partial outage | 50% of users affected, key feature broken, significant latency | < 30 minutes |
| SEV-3 / P3 | Minor degradation, workaround exists | One region affected, non-critical feature down | < 4 hours |
| SEV-4 / P4 | Cosmetic, informational | Dashboard error, log noise, minor UI glitch | Next business day |
Classification Decision Tree¶
Is the issue customer-facing?
├── YES → Are >50% of users affected?
│ ├── YES → Is there data loss or security exposure?
│ │ ├── YES → SEV-1
│ │ └── NO → SEV-1 or SEV-2 (based on revenue impact)
│ └── NO → Is there a workaround?
│ ├── YES → SEV-3
│ └── NO → SEV-2
└── NO → Is it blocking internal operations?
├── YES → SEV-3
└── NO → SEV-4
Triage Checklist¶
When an alert fires, work through this in order:
1. Acknowledge and Assess (0-2 minutes)¶
War story: Google SRE's famous "The Site Reliability Workbook" documents a case where an on-caller spent 45 minutes debugging a false positive while a real P1 went unnoticed in a separate alert. The lesson: always acknowledge the alert first, then spend 60 seconds checking whether other alerts are also firing. A cluster of alerts usually means the real problem is upstream.
- Acknowledge the alert (stop escalation timer)
- Read the alert message and linked runbook
- Check: is this a known issue or repeat incident?
- Determine initial severity classification
2. Verify the Signal (2-5 minutes)¶
- Confirm the alert is real (not a monitoring false positive)
- Check multiple data sources: metrics, logs, status page, synthetic checks
- Check for related alerts firing simultaneously
- Determine: is this a symptom or the root cause?
3. Assess Blast Radius (5-10 minutes)¶
Key questions:
- Which services are affected?
- Which regions/zones are affected?
- How many users are impacted?
- Is the issue getting worse, stable, or recovering?
- Are there dependent systems at risk?
- Is there data integrity risk?
4. Communicate (within 10 minutes for SEV-1/2)¶
- Post in the incident channel (create one if needed)
- Update the status page
- Notify stakeholders per severity escalation matrix
- Assign roles: Incident Commander, Communications Lead, Technical Lead
5. Mitigate or Escalate¶
Remember: The "3R" mitigation priority: Rollback (safest — revert the last deploy), Restart (cheap — bounce the service), Rescale (fast — add capacity). Try them in this order before diving into root cause analysis. Most SEV-1s are caused by recent changes, and rollback resolves them in minutes. Finding root cause can wait until the fire is out.
- If you can mitigate: do so (rollback, failover, restart, scale up)
- If you cannot: escalate to the owning team with context gathered so far
- Do not spend more than 15 minutes attempting a fix alone on a SEV-1
Blast Radius Assessment¶
Mapping Impact¶
Alert: "Payment service error rate > 5%"
Blast radius check:
├── Payment service → directly affected
├── Checkout flow → depends on payment → affected
├── Order service → depends on checkout → potentially affected
├── Inventory service → independent → not affected
├── User service → independent → not affected
└── Mobile app → calls checkout → affected
Impact: 3 of 5 services affected, checkout flow blocked
Revenue impact: ~$X per minute
Severity: SEV-1
Tools for Assessment¶
# Check service health endpoints
for svc in payment checkout order inventory; do
echo "$svc: $(curl -s -o /dev/null -w '%{http_code}' https://$svc.internal/health)"
done
# Check error rates in metrics
promql: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# Check recent deployments (common cause)
kubectl rollout history deployment/payment-svc
git log --oneline --since="2 hours ago" -- payment-svc/
Communication During Incidents¶
Incident Channel Template¶
INCIDENT: [SEV-X] Brief description
STATUS: Investigating | Identified | Monitoring | Resolved
IMPACT: What users/systems are affected
IC: @person (Incident Commander)
LAST UPDATE: HH:MM UTC — what we know now
NEXT UPDATE: HH:MM UTC
Stakeholder Updates¶
| Audience | What They Need |
|---|---|
| Engineering | Technical details, service graph, logs |
| Support | Customer-facing impact, ETA, workarounds |
| Management | Business impact, severity, timeline |
| Customers | Status page update, honest timeline |
Avoid: blaming individuals, providing unsupportable ETAs, going silent for 30+ minutes on SEV-1, sharing technical details with non-technical stakeholders, and forgetting to update the status page.
Escalation¶
When to Escalate¶
- You have spent 15 minutes and do not understand the failure mode
- The issue is in a system you do not own
- Severity needs to be raised
- Additional expertise is required (DBA, security, network)
- Customer or business impact is growing
Escalation Checklist¶
When escalating, provide:
- What is happening (symptoms, not theories)
- What has been tried so far
- When it started
- Current blast radius
- Links to dashboards, logs, and alerts
Post-Triage¶
Once mitigated:
- Write a timeline of events while memory is fresh
- Keep the incident channel open for follow-up
- Schedule a blameless postmortem within 48 hours for SEV-1/2
- Track action items from the postmortem to completion
- Update runbooks if the existing ones were inadequate
Interview tip: "Tell me about your incident response process" is a common DevOps interview question. A strong answer covers: severity classification, on-call rotation, communication cadence, escalation paths, blameless postmortems, and — critically — that you track postmortem action items to completion. Many teams do postmortems but never follow up on the actions, meaning the same incident recurs.
Common Triage Mistakes¶
- Alert fatigue: Ignoring alerts because most are noise — fix signal-to-noise ratio instead
- Hero culture: One person tries to fix everything alone instead of escalating
- Premature root cause: Declaring root cause before verifying — leads to fixing symptoms
- Tunnel vision: Focusing on one hypothesis and ignoring contradicting evidence
- No communication: Working silently while stakeholders assume nothing is happening
- Skipping verification: Assuming the alert is accurate without confirming from multiple sources
Remember: The triage workflow mnemonic: "AAVCE" — Acknowledge, Assess, Verify, Communicate, Escalate. Work through these steps in order. Skipping to "fix it" before verifying the signal is the most common triage mistake.
Gotcha: The most expensive triage mistake is not escalating a SEV-1 fast enough. A 15-minute delay in escalation can turn a 30-minute outage into a 3-hour outage. The second most expensive: incorrectly downgrading severity because "it's probably fine." When in doubt, treat it as one severity higher than you think.
Wiki Navigation¶
Related Content¶
- Incident Triage Flashcards (CLI) (flashcard_deck, L1) — Incident Triage
- Runbook: CVE Response (Critical Vulnerability) (Runbook, L2) — Incident Triage
- Runbook: Unauthorized Access Investigation (Runbook, L2) — Incident Triage
Pages that link here¶
- Anti-Primer: Incident Triage
- Certification Prep: PCA — Prometheus Certified Associate
- Incident Triage
- Master Curriculum: 40 Weeks
- Production Readiness Review: Answer Key
- Production Readiness Review: Study Plans
- Runbook: CVE Response (Critical Vulnerability)
- Runbook: Unauthorized Access Investigation
- Thinking Out Loud: Incident Triage