Portal | Level: L2: Operations | Topics: Incident Response | Domain: DevOps & Tooling

Track: Incident Response¶

Incidents, forensics, runbooks, postmortems, interview scenarios.

Goals¶

Respond to production incidents systematically (detect, triage, mitigate, resolve)
Use runbooks for structured troubleshooting
Capture forensic evidence during incidents
Practice time-boxed incident challenges
Prepare for SRE/DevOps interview scenarios
Apply chaos engineering safely

Concepts: kubernetes, helm_upgrade, prometheus, loki_logging, readiness_probe, resource_limits
All previous tracks completed (or equivalent experience)
make deploy-all completed

Read: training/library/runbooks/crashloopbackoff.md — study runbook format
Read: training/library/runbooks/kubernetes/readiness_probe_failed.md — probe troubleshooting
Study: All 15 failure patterns (FP-001 through FP-015) — review runbooks and incident scenarios
Practice: make incident YES=1 — inject random incident
Practice: make investigate — follow guided investigation
Practice: make hint — use progressive hints if stuck
Practice: make incident-resolve — mark resolved
Practice: make challenge YES=1 MINUTES=10 — time-boxed challenge
Practice: make incident-forensics — capture evidence bundle
Run: 2-3 chaos scripts from training/interactive/chaos/scripts/ — fault injection
Interview: training/library/interview-scenarios/01-deployment-stuck-progressing.md
Interview: training/library/interview-scenarios/05-helm-upgrade-broke-prod.md
Interview: training/library/interview-scenarios/08-pods-oomkilled.md
Interview: Work through remaining training/library/interview-scenarios/
Study: training/knowledge_architecture/commands/kubectl_debugging_flow.md — master the debugging decision tree

Complete all 18 incident scenarios: make incident-list
training/interactive/investigation/ — full guided investigation engine
training/interactive/knowledge/data/cards/chaos-engineering.tsv — chaos flashcards
make scoreboard — track your resolution times

Change Management (Topic Pack, L1) — Incident Response
Chaos Engineering Scripts (CLI) (Exercise Set, L2) — Incident Response
Debugging Methodology (Topic Pack, L1) — Incident Response
Incident Command & On-Call (Topic Pack, L2) — Incident Response
Incident Response Flashcards (CLI) (flashcard_deck, L1) — Incident Response
Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Incident Response
Investigation Engine (CLI) (Exercise Set, L2) — Incident Response
Ops War Stories & Pattern Recognition (Topic Pack, L2) — Incident Response
Postmortems & SLOs (Topic Pack, L2) — Incident Response
Runbook Craft (Topic Pack, L1) — Incident Response