devops
l1
topic-pack
incident-triage --- Portal | Level: L1: Foundations | Topics: Incident Triage | Domain: DevOps & Tooling

Incident Triage Primer¶

Why This Matters¶

When an alert fires at 3 AM, the first five minutes determine whether you resolve the incident in 30 minutes or 3 hours. Triage is the disciplined process of assessing severity, identifying blast radius, communicating status, and routing to the right responder. Bad triage turns a P3 into a P1 through panic, and lets a real P1 sit unnoticed because nobody assessed impact.

Severity Classification¶

Fun fact: The incident management discipline originated in emergency services (fire, medical). The Incident Command System (ICS) was developed by California firefighters in the 1970s after catastrophic failures in coordinating responses to wildfires. Tech adopted ICS terminology — Incident Commander, Communications Lead — from this lineage. Google's SRE book formalized it for software operations.

Standard Severity Levels¶

Level	Impact	Examples	Response Time
SEV-1 / P1	Complete outage, data loss, security breach	Production down, data exfiltration, full service unavailable	Immediate (all hands)
SEV-2 / P2	Major degradation, partial outage	50% of users affected, key feature broken, significant latency	< 30 minutes
SEV-3 / P3	Minor degradation, workaround exists	One region affected, non-critical feature down	< 4 hours
SEV-4 / P4	Cosmetic, informational	Dashboard error, log noise, minor UI glitch	Next business day

Classification Decision Tree¶

Is the issue customer-facing?
├── YES → Are >50% of users affected?
│   ├── YES → Is there data loss or security exposure?
│   │   ├── YES → SEV-1
│   │   └── NO  → SEV-1 or SEV-2 (based on revenue impact)
│   └── NO  → Is there a workaround?
│       ├── YES → SEV-3
│       └── NO  → SEV-2
└── NO  → Is it blocking internal operations?
    ├── YES → SEV-3
    └── NO  → SEV-4

Triage Checklist¶

When an alert fires, work through this in order:

1. Acknowledge and Assess (0-2 minutes)¶

War story: Google SRE's famous "The Site Reliability Workbook" documents a case where an on-caller spent 45 minutes debugging a false positive while a real P1 went unnoticed in a separate alert. The lesson: always acknowledge the alert first, then spend 60 seconds checking whether other alerts are also firing. A cluster of alerts usually means the real problem is upstream.

Acknowledge the alert (stop escalation timer)
Read the alert message and linked runbook
Check: is this a known issue or repeat incident?
Determine initial severity classification

2. Verify the Signal (2-5 minutes)¶

Confirm the alert is real (not a monitoring false positive)
Check multiple data sources: metrics, logs, status page, synthetic checks
Check for related alerts firing simultaneously
Determine: is this a symptom or the root cause?

3. Assess Blast Radius (5-10 minutes)¶

Key questions:

- Which services are affected?
- Which regions/zones are affected?
- How many users are impacted?
- Is the issue getting worse, stable, or recovering?
- Are there dependent systems at risk?
- Is there data integrity risk?

4. Communicate (within 10 minutes for SEV-1/2)¶

Post in the incident channel (create one if needed)
Update the status page
Notify stakeholders per severity escalation matrix
Assign roles: Incident Commander, Communications Lead, Technical Lead

5. Mitigate or Escalate¶

Remember: The "3R" mitigation priority: Rollback (safest — revert the last deploy), Restart (cheap — bounce the service), Rescale (fast — add capacity). Try them in this order before diving into root cause analysis. Most SEV-1s are caused by recent changes, and rollback resolves them in minutes. Finding root cause can wait until the fire is out.

If you can mitigate: do so (rollback, failover, restart, scale up)
If you cannot: escalate to the owning team with context gathered so far
Do not spend more than 15 minutes attempting a fix alone on a SEV-1

Blast Radius Assessment¶

Mapping Impact¶

Alert: "Payment service error rate > 5%"

Blast radius check:
├── Payment service → directly affected
├── Checkout flow → depends on payment → affected
├── Order service → depends on checkout → potentially affected
├── Inventory service → independent → not affected
├── User service → independent → not affected
└── Mobile app → calls checkout → affected

Impact: 3 of 5 services affected, checkout flow blocked
Revenue impact: ~$X per minute
Severity: SEV-1

Tools for Assessment¶

# Check service health endpoints
for svc in payment checkout order inventory; do
    echo "$svc: $(curl -s -o /dev/null -w '%{http_code}' https://$svc.internal/health)"
done

# Check error rates in metrics
promql: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# Check recent deployments (common cause)
kubectl rollout history deployment/payment-svc
git log --oneline --since="2 hours ago" -- payment-svc/

Communication During Incidents¶

Incident Channel Template¶

INCIDENT: [SEV-X] Brief description
STATUS: Investigating | Identified | Monitoring | Resolved
IMPACT: What users/systems are affected
IC: @person (Incident Commander)
LAST UPDATE: HH:MM UTC — what we know now
NEXT UPDATE: HH:MM UTC

Stakeholder Updates¶

Audience	What They Need
Engineering	Technical details, service graph, logs
Support	Customer-facing impact, ETA, workarounds
Management	Business impact, severity, timeline
Customers	Status page update, honest timeline

Avoid: blaming individuals, providing unsupportable ETAs, going silent for 30+ minutes on SEV-1, sharing technical details with non-technical stakeholders, and forgetting to update the status page.

Escalation¶

When to Escalate¶

You have spent 15 minutes and do not understand the failure mode
The issue is in a system you do not own
Severity needs to be raised
Additional expertise is required (DBA, security, network)
Customer or business impact is growing

Escalation Checklist¶

When escalating, provide:

What is happening (symptoms, not theories)
What has been tried so far
When it started
Current blast radius
Links to dashboards, logs, and alerts

Post-Triage¶

Once mitigated:

Write a timeline of events while memory is fresh
Keep the incident channel open for follow-up
Schedule a blameless postmortem within 48 hours for SEV-1/2
Track action items from the postmortem to completion
Update runbooks if the existing ones were inadequate

Interview tip: "Tell me about your incident response process" is a common DevOps interview question. A strong answer covers: severity classification, on-call rotation, communication cadence, escalation paths, blameless postmortems, and — critically — that you track postmortem action items to completion. Many teams do postmortems but never follow up on the actions, meaning the same incident recurs.

Common Triage Mistakes¶

Alert fatigue: Ignoring alerts because most are noise — fix signal-to-noise ratio instead
Hero culture: One person tries to fix everything alone instead of escalating
Premature root cause: Declaring root cause before verifying — leads to fixing symptoms
Tunnel vision: Focusing on one hypothesis and ignoring contradicting evidence
No communication: Working silently while stakeholders assume nothing is happening
Skipping verification: Assuming the alert is accurate without confirming from multiple sources

Remember: The triage workflow mnemonic: "AAVCE" — Acknowledge, Assess, Verify, Communicate, Escalate. Work through these steps in order. Skipping to "fix it" before verifying the signal is the most common triage mistake.

Gotcha: The most expensive triage mistake is not escalating a SEV-1 fast enough. A 15-minute delay in escalation can turn a 30-minute outage into a 3-hour outage. The second most expensive: incorrectly downgrading severity because "it's probably fine." When in doubt, treat it as one severity higher than you think.

Incident Triage Flashcards (CLI) (flashcard_deck, L1) — Incident Triage
Runbook: CVE Response (Critical Vulnerability) (Runbook, L2) — Incident Triage
Runbook: Unauthorized Access Investigation (Runbook, L2) — Incident Triage

Incident Triage Primer¶

Why This Matters¶

Severity Classification¶

Standard Severity Levels¶

Classification Decision Tree¶

Triage Checklist¶

1. Acknowledge and Assess (0-2 minutes)¶

2. Verify the Signal (2-5 minutes)¶

3. Assess Blast Radius (5-10 minutes)¶

4. Communicate (within 10 minutes for SEV-1/2)¶

5. Mitigate or Escalate¶

Blast Radius Assessment¶

Mapping Impact¶

Tools for Assessment¶

Communication During Incidents¶

Incident Channel Template¶

Stakeholder Updates¶

Escalation¶

When to Escalate¶

Escalation Checklist¶

Post-Triage¶

Common Triage Mistakes¶

Wiki Navigation¶

Pages that link here¶

Incident Triage Primer¶

Why This Matters¶

Severity Classification¶

Standard Severity Levels¶

Classification Decision Tree¶

Triage Checklist¶

1. Acknowledge and Assess (0-2 minutes)¶

2. Verify the Signal (2-5 minutes)¶

3. Assess Blast Radius (5-10 minutes)¶

4. Communicate (within 10 minutes for SEV-1/2)¶

5. Mitigate or Escalate¶

Blast Radius Assessment¶

Mapping Impact¶

Tools for Assessment¶

Communication During Incidents¶

Incident Channel Template¶

Stakeholder Updates¶

Escalation¶

When to Escalate¶

Escalation Checklist¶

Post-Triage¶

Common Triage Mistakes¶

Wiki Navigation¶

Related Content¶

Pages that link here¶