Skip to content

Portal | Level: L2: Operations | Topics: Incident Response | Domain: DevOps & Tooling

Track: Incident Response

Incidents, forensics, runbooks, postmortems, interview scenarios.

Goals

  • Respond to production incidents systematically (detect, triage, mitigate, resolve)
  • Use runbooks for structured troubleshooting
  • Capture forensic evidence during incidents
  • Practice time-boxed incident challenges
  • Prepare for SRE/DevOps interview scenarios
  • Apply chaos engineering safely

Prerequisites

  • Concepts: kubernetes, helm_upgrade, prometheus, loki_logging, readiness_probe, resource_limits
  • All previous tracks completed (or equivalent experience)
  • make deploy-all completed

Primary Path (15 steps)

  1. Read: training/library/runbooks/crashloopbackoff.md — study runbook format
  2. Read: training/library/runbooks/kubernetes/readiness_probe_failed.md — probe troubleshooting
  3. Study: All 15 failure patterns (FP-001 through FP-015) — review runbooks and incident scenarios
  4. Practice: make incident YES=1 — inject random incident
  5. Practice: make investigate — follow guided investigation
  6. Practice: make hint — use progressive hints if stuck
  7. Practice: make incident-resolve — mark resolved
  8. Practice: make challenge YES=1 MINUTES=10 — time-boxed challenge
  9. Practice: make incident-forensics — capture evidence bundle
  10. Run: 2-3 chaos scripts from training/interactive/chaos/scripts/ — fault injection
  11. Interview: training/library/interview-scenarios/01-deployment-stuck-progressing.md
  12. Interview: training/library/interview-scenarios/05-helm-upgrade-broke-prod.md
  13. Interview: training/library/interview-scenarios/08-pods-oomkilled.md
  14. Interview: Work through remaining training/library/interview-scenarios/
  15. Study: training/knowledge_architecture/commands/kubectl_debugging_flow.md — master the debugging decision tree

Optional Deepening


Wiki Navigation

Prerequisites

  • Change Management (Topic Pack, L1) — Incident Response
  • Chaos Engineering Scripts (CLI) (Exercise Set, L2) — Incident Response
  • Debugging Methodology (Topic Pack, L1) — Incident Response
  • Incident Command & On-Call (Topic Pack, L2) — Incident Response
  • Incident Response Flashcards (CLI) (flashcard_deck, L1) — Incident Response
  • Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Incident Response
  • Investigation Engine (CLI) (Exercise Set, L2) — Incident Response
  • Ops War Stories & Pattern Recognition (Topic Pack, L2) — Incident Response
  • Postmortems & SLOs (Topic Pack, L2) — Incident Response
  • Runbook Craft (Topic Pack, L1) — Incident Response