Incident Response¶

27 cards — 🟢 4 easy | 🟡 5 medium | 🔴 3 hard

🟢 Easy (4)¶

1. What is the Incident Commander (IC) role?

Show answer

The IC owns the incident lifecycle: declares severity, coordinates responders, manages communication, and decides when to escalate or resolve. The IC does NOT debug — they facilitate. They run the war room, keep a timeline, and ensure status updates go out on schedule.

2. What should an initial incident notification include?

Show answer

Severity level, affected service(s), user impact summary, current status (Investigating/Identified/Monitoring/Resolved), who is responding, next update ETA, and a link to the incident channel or bridge call. Over-communicate early — silence breeds confusion.

3. What is a runbook and what should it contain?

Show answer

A runbook is a step-by-step guide for handling a specific alert or operational task. It should include: alert description, impact assessment, diagnostic commands, remediation steps, escalation path, and verification steps. Keep runbooks in version control, link them from alerts, and update after every incident where the runbook was insufficient.

4. What is the purpose of a status page during an incident?

Show answer

A status page (Statuspage.io, Cachet) communicates outage status to customers and internal stakeholders, reducing support ticket volume and building trust through transparency. Update it at every severity change and at regular intervals. Include: affected components, current status, ETA if known, and workarounds.

🟡 Medium (5)¶

1. When should you roll back vs fix forward?

Show answer

Roll back when: the cause is a recent deploy, rollback is safe and fast, no irreversible data migrations have run. Fix forward when: rollback would cause data loss, the fix is small and well-understood, or the issue predates the last deploy. Default to rollback if unsure — speed matters in outages.

2. What are the first three things to check during a production outage?

Show answer

1) Monitoring dashboards — scope the blast radius (total vs partial, which regions/services).
2) Recent changes — check deploy logs, config changes, infra changes in the last hour.
3) Cluster/infra health — node status, pod health, external dependency status. Do NOT start fixing until you understand the scope.

3. How do you determine incident severity?

Show answer

SEV-1: complete outage or data loss/breach. SEV-2: major degradation, significant user impact. SEV-3: minor degradation with workaround available. SEV-4: cosmetic or low-impact. Key factors: number of users affected, revenue impact, data integrity risk, and whether a workaround exists.

4. When should you escalate an incident?

Show answer

Escalate when: you've been troubleshooting for 15+ minutes without progress, the issue is outside your domain expertise, severity is increasing, customer/business impact is growing, or you need access/permissions you don't have. Escalating early is not failure — it is responsible incident management.

5. Why is maintaining an incident timeline critical?

Show answer

The timeline captures what happened, when, and what actions were taken. During the incident, it prevents duplicate work and keeps new responders oriented. After the incident, it is the foundation for the postmortem and helps identify delays in detection, response, and resolution. Use a shared doc or incident tool — never rely on memory.

🔴 Hard (3)¶

1. What makes a good blameless postmortem?

Show answer

Focus on systemic causes, not individual mistakes. Include: timeline of events, what went well, what went poorly, root cause analysis (5 Whys), action items with owners and deadlines. Share widely. The goal is learning and prevention, not punishment. Track action item completion.

2. How do you manage multiple simultaneous incidents?

Show answer

Assign separate Incident Commanders for each. Determine if incidents are related (common root cause). If related, merge into one incident with increased severity. If independent, ensure responders are not overloaded across both. Prioritize by severity — SEV-1 gets resources first.

3. How does chaos engineering improve incident response?

Show answer

Chaos engineering (GameDays, Chaos Monkey) intentionally injects failures in controlled conditions to test detection, alerting, runbooks, and team response. It reveals gaps before real incidents do: missing alerts, unclear runbooks, slow escalation paths, and single points of failure. Run regularly and track improvements.