Lab 15: Incident Response¶
| Field | Value |
|---|---|
| Tier | 3 — Operations |
| Estimated Time | 60 minutes |
| Prerequisites | k3s cluster |
| Auto-Grade | Yes |
Scenario¶
It is 2:00 PM on a Tuesday. PagerDuty fires a P1 alert: "Payment processing failure rate > 50%." You are the incident commander. The clock is ticking — every minute of downtime costs the company approximately $2,000 in lost revenue. Your team is watching you in the incident Slack channel.
The environment consists of a frontend pod, a payment API pod, and a database pod. Something is broken. You need to follow the incident response framework: detect, triage, diagnose, mitigate, resolve, and write a postmortem. The setup script will deploy a broken application stack, and your job is to work through the incident lifecycle systematically.
Objectives¶
- Triage: identify which component is failing (check pod status, logs, events)
- Diagnose: determine the root cause from evidence (not guessing)
- Mitigate: apply a temporary fix to restore service (within 15 minutes)
- Verify: confirm the error rate has dropped (all pods healthy)
- Timeline: write an incident timeline to
/tmp/lab-incident/timeline.txt - Postmortem: write a postmortem to
/tmp/lab-incident/postmortem.txt
Setup¶
Deploys a broken application stack in namespace lab-incident.
Hints¶
Hint 1: Triage steps
Start with `kubectl get pods -n lab-incident` to see status, then `kubectl describe podHint 2: Common K8s failures
Check for: CrashLoopBackOff, ImagePullBackOff, OOMKilled, failed probes, missing ConfigMaps/Secrets, resource exhaustion, and wrong environment variables.Hint 3: Timeline format
Hint 4: Postmortem structure
Include: Title, Date, Duration, Impact, Root Cause, Timeline, Action Items. Focus on systemic improvements, not blame.Hint 5: Quick mitigation
If a pod is crash-looping, check if the issue is configuration (fixable via kubectl edit or patch) or code (might need a rollback to a working image).Grading¶
Solution¶
See the solution/ directory for the incident walkthrough.