Portal | Level: L2: Operations | Topics: Incident Response | Domain: DevOps & Tooling
Track: Incident Response¶
Incidents, forensics, runbooks, postmortems, interview scenarios.
Goals¶
- Respond to production incidents systematically (detect, triage, mitigate, resolve)
- Use runbooks for structured troubleshooting
- Capture forensic evidence during incidents
- Practice time-boxed incident challenges
- Prepare for SRE/DevOps interview scenarios
- Apply chaos engineering safely
Prerequisites¶
- Concepts: kubernetes, helm_upgrade, prometheus, loki_logging, readiness_probe, resource_limits
- All previous tracks completed (or equivalent experience)
make deploy-allcompleted
Primary Path (15 steps)¶
- Read: training/library/runbooks/crashloopbackoff.md — study runbook format
- Read: training/library/runbooks/kubernetes/readiness_probe_failed.md — probe troubleshooting
- Study: All 15 failure patterns (FP-001 through FP-015) — review runbooks and incident scenarios
- Practice:
make incident YES=1— inject random incident - Practice:
make investigate— follow guided investigation - Practice:
make hint— use progressive hints if stuck - Practice:
make incident-resolve— mark resolved - Practice:
make challenge YES=1 MINUTES=10— time-boxed challenge - Practice:
make incident-forensics— capture evidence bundle - Run: 2-3 chaos scripts from training/interactive/chaos/scripts/ — fault injection
- Interview: training/library/interview-scenarios/01-deployment-stuck-progressing.md
- Interview: training/library/interview-scenarios/05-helm-upgrade-broke-prod.md
- Interview: training/library/interview-scenarios/08-pods-oomkilled.md
- Interview: Work through remaining training/library/interview-scenarios/
- Study: training/knowledge_architecture/commands/kubectl_debugging_flow.md — master the debugging decision tree
Optional Deepening¶
- Complete all 18 incident scenarios:
make incident-list - training/interactive/investigation/ — full guided investigation engine
- training/interactive/knowledge/data/cards/chaos-engineering.tsv — chaos flashcards
- make scoreboard — track your resolution times
Wiki Navigation¶
Prerequisites¶
- Track: Kubernetes Core (Reference, L1)
- Track: Observability (Reference, L2)
Related Content¶
- Change Management (Topic Pack, L1) — Incident Response
- Chaos Engineering Scripts (CLI) (Exercise Set, L2) — Incident Response
- Debugging Methodology (Topic Pack, L1) — Incident Response
- Incident Command & On-Call (Topic Pack, L2) — Incident Response
- Incident Response Flashcards (CLI) (flashcard_deck, L1) — Incident Response
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Incident Response
- Investigation Engine (CLI) (Exercise Set, L2) — Incident Response
- Ops War Stories & Pattern Recognition (Topic Pack, L2) — Incident Response
- Postmortems & SLOs (Topic Pack, L2) — Incident Response
- Runbook Craft (Topic Pack, L1) — Incident Response
Pages that link here¶
- Debugging Methodology
- Kubernetes_Core
- Observability
- Ops War Stories & Pattern Recognition
- Runbook Craft
- Runbook: Readiness Probe Failed
- Scenario: Deployment Stuck Progressing
- Scenario: Helm Upgrade Broke Prod — Recover Fast
- Scenario: Pods OOMKilled Under Load
- Training Curriculum
- kubectl Debugging Decision Flow