Incident Triage¶

28 cards — 🟢 4 easy | 🟡 10 medium | 🔴 6 hard

🟢 Easy (4)¶

1. What are the four standard incident severity levels and their meanings?

Show answer

SEV-1: complete outage or data loss (immediate response). SEV-2: major degradation or partial outage (< 30 min response). SEV-3: minor degradation with workaround (< 4 hours). SEV-4: cosmetic or informational (next business day).

2. What is the first thing you should do when an alert fires?

Show answer

Acknowledge the alert to stop the escalation timer, then read the alert message and any linked runbook. Check if it is a known issue or repeat incident before diving into diagnosis.

3. What information should an incident channel post contain?

Show answer

Severity level, brief description, current status (Investigating/Identified/Monitoring/Resolved), impact summary, Incident Commander name, last update timestamp, and next update time.

Remember: Mnemonic SSCINL — Severity, Status, Commander, Impact, Next update, Link. Six elements for every incident post.

Gotcha: Missing the next-update-time is the #1 communication failure. Silence breeds panic.

4. What makes a runbook effective during an incident versus one that gets ignored?

Show answer

Effective runbooks have: a clear trigger condition (when to use this), step-by-step commands (copy-pasteable), expected output for each step, escalation contacts with current names, and a last-verified date. Runbooks older than 90 days without verification are unreliable.

🟡 Medium (10)¶

1. What questions should you ask when assessing blast radius?

Show answer

Which services are affected? Which regions/zones? How many users impacted? Is it getting worse, stable, or recovering? Are dependent systems at risk? Is there data integrity risk?

Remember: Blast radius assessment = scope the damage before fixing. Mnemonic SWIG-D: Services, Width (regions), Impact (users), Getting worse?, Data integrity.

Gotcha: Skip blast radius assessment and you may fix one symptom while the real problem spreads.

2. When should you escalate an incident instead of continuing to troubleshoot alone?

Show answer

Escalate when: you have spent 15 minutes without understanding the failure mode, the issue is in a system you do not own, severity needs to be raised, additional expertise is required, or customer impact is growing.

3. What five pieces of context should you provide when escalating an incident?

Show answer

1) What is happening (symptoms, not theories). 2) What has been tried so far. 3) When it started. 4) Current blast radius. 5) Links to dashboards, logs, and alerts.

Remember: Escalation context mnemonic WWTBL: What happened, What tried, Time started, Blast radius, Links to dashboards.

Gotcha: Never escalate with just it is broken. Context saves the next responder 15+ minutes of re-discovery.

4. Why should you verify the alert signal before mobilizing a full response?

Show answer

Monitoring can produce false positives from misconfigured thresholds, flapping metrics, or stale checks. Confirming from multiple data sources (metrics, logs, synthetic checks) prevents wasting time and team energy on phantom incidents.

5. When and how should the Incident Commander role be transferred during a long-running incident?

Show answer

Transfer IC when the current IC is fatigued (2+ hours), when a shift boundary is reached, or when a subject-matter expert should lead. The outgoing IC briefs the incoming IC on status, timeline, blast radius, and next actions, then announces the handoff in the incident channel.

6. What should an incident timeline capture, and when should you start writing it?

Show answer

Start the timeline immediately when the incident is declared. Record: alert fire time, first responder actions, each escalation, key diagnostic findings, mitigation attempts (successful and failed), and resolution confirmation. Timestamps should be in UTC.

7. What are practical blast radius containment techniques during an active incident?

Show answer

Disable the feature flag that triggered the issue, shift traffic away from the affected region (DNS or load balancer), scale down the offending service, block the problematic endpoint at the gateway, or isolate the affected database replica. Goal: stop the bleeding before diagnosing root cause.

8. What are the typical severity levels in an incident management system?

Show answer

SEV1 (critical, customer-facing outage), SEV2 (major degradation, partial outage), SEV3 (minor issue, limited impact), SEV4/SEV5 (informational, cosmetic). Each level triggers different response expectations.

Remember: SEV1=all-hands, SEV2=team-response, SEV3=next-sprint, SEV4/5=backlog. Response time doubles with each level.

Gotcha: Under-declaring severity delays response. Over-declaring causes alert fatigue. When in doubt, declare higher and downgrade.

9. Why should you declare an incident early rather than waiting to confirm the issue?

Show answer

Declaring early enables coordination and visibility before the problem grows. False alarms are cheap; delayed response to real incidents is expensive. 'Declare first, investigate second' reduces MTTR.

Remember: Declare first, investigate second. The cost of a false alarm is low; the cost of a delayed response is high.

Analogy: Like pulling a fire alarm — better to evacuate for a false alarm than to delay during a real fire.

10. What makes a runbook useful during incident triage?

Show answer

A good runbook provides: symptom-to-action mapping, diagnostic commands to run, escalation paths, and known fixes. It reduces cognitive load and ensures consistent response regardless of who is on-call.

Remember: Good runbook = symptom-to-action mapping. Bad runbook = 50-page document nobody reads during a 3 AM outage.

Gotcha: Untested runbooks are worse than no runbook — they give false confidence. Verify quarterly.

🔴 Hard (6)¶

1. What is "premature root cause" and why is it dangerous during triage?

Show answer

Declaring root cause before verifying it leads to fixing symptoms while the real problem continues. It creates tunnel vision where contradicting evidence is ignored, potentially extending the incident.

Analogy: Like a doctor diagnosing chest pain as heartburn before running an EKG — premature diagnosis can be fatal.

Remember: Correlation is not causation — the deploy happened before the outage, but that does not mean it caused it.

2. What are the key communication anti-patterns during an incident?

Show answer

Blaming individuals during the incident, providing unsupportable ETAs, going silent for more than 30 minutes on SEV-1, sharing technical details with non-technical stakeholders, and forgetting to update the status page.

3. What should happen after an incident is mitigated but before it is closed?

Show answer

Write a timeline while memory is fresh, keep the incident channel open for follow-up, schedule a blameless postmortem within 48 hours for SEV-1/2, track action items to completion, and update runbooks if they were inadequate.

4. How do you decide between rolling back a change and pushing a forward-fix during an incident?

Show answer

Roll back when: the failing change is identified with high confidence, rollback is tested and fast (< 5 min), and data integrity is not at risk. Forward-fix when: rollback is impossible (schema migration, data change), the fix is small and well-understood, or rollback would cause equal or worse impact.

5. What should customer-facing communication include during a SEV-1 incident?

Show answer

Acknowledge the issue without speculating on cause, state the known impact, provide an estimated next-update time (not ETA to resolution), use plain language without internal jargon, and update at regular intervals (every 30 min for SEV-1) even if status has not changed.

6. How do you assess blast radius during the first 5 minutes of an incident?

Show answer

Check: which services are affected (dependency graph), which customers are impacted (traffic/error dashboards), is the issue spreading (error rate trend), and what changed recently (deploy log, config changes).

Remember: First 5 minutes: dependency graph + error dashboards + deploy log + error trend. Four data sources in parallel.

Gotcha: Is it spreading? is the most critical question. A growing blast radius means containment is priority #1.