Level 5: SRE & Incident Response¶
Incidents, chaos engineering, forensics, postmortems, interview preparation.
Concepts¶
chaos_engineering, all failure patterns, all debugging flows
Failure Patterns You Should Be Able to Resolve¶
ALL 15 patterns (FP-001 through FP-015)
Commands You Should Be Fluent With¶
All commands from Levels 1-4, plus:
- make incident YES=1 / make incident-resolve
- make challenge YES=1 MINUTES=10
- make investigate / make hint / make explain
- make incident-forensics
- Chaos scripts (all 7)
Assets to Complete¶
Incident practice¶
-
make incident YES=1— complete 5+ random incidents -
make challenge YES=1 MINUTES=10— complete 3+ timed challenges -
make incident-forensics— capture at least one evidence bundle
Chaos engineering¶
- training/interactive/chaos/scripts/kill_pods.sh --yes
- training/interactive/chaos/scripts/break_readiness.sh --yes
- training/interactive/chaos/scripts/toggle_networkpolicy.sh --yes
Guided investigation¶
-
make investigate— use the full investigation loop at least 3 times - Write journal entries with
make explain
Interview preparation¶
- training/library/interview-scenarios/01-deployment-stuck-progressing.md
- training/library/interview-scenarios/02-hpa-not-scaling.md
- training/library/interview-scenarios/03-prometheus-target-down.md
- training/library/interview-scenarios/04-loki-logs-disappeared.md
- training/library/interview-scenarios/05-helm-upgrade-broke-prod.md
- training/library/interview-scenarios/06-ci-vuln-scan-failed.md
- training/library/interview-scenarios/07-config-drift-detected.md
- training/library/interview-scenarios/08-pods-oomkilled.md
- training/library/interview-scenarios/09-rbac-forbidden.md
- training/library/interview-scenarios/10-ingress-404.md
Self-assessment¶
- Complete all skillchecks in training/library/skillchecks/
- Review training/knowledge_architecture/commands/kubectl_debugging_flow.md
- Review all runbooks in training/library/runbooks/
Review (flashcards)¶
- training/interactive/knowledge/data/cards/chaos-engineering.tsv
- training/interactive/knowledge/data/cards/devops.tsv
Pages that link here¶
- DevOps Skill Check Pack (internals-first, visual, jargon-explained)
- Operational Runbooks
- Scenario: CI Failed Due to Vulnerability Scan
- Scenario: Config Drift Detected in Production
- Scenario: Deployment Stuck Progressing
- Scenario: HPA Not Scaling Under Load
- Scenario: Helm Upgrade Broke Prod — Recover Fast
- Scenario: Ingress Returns 404 Intermittently
- Scenario: Logs Disappeared from Grafana Loki
- Scenario: Pods OOMKilled Under Load
- Scenario: Prometheus Says Target Down
- Scenario: RBAC Forbidden Error During Deploy
- Training Curriculum
- kubectl Debugging Decision Flow