Lab 15: Incident Response¶

Field	Value
Tier	3 — Operations
Estimated Time	60 minutes
Prerequisites	k3s cluster
Auto-Grade	Yes

Scenario¶

It is 2:00 PM on a Tuesday. PagerDuty fires a P1 alert: "Payment processing failure rate > 50%." You are the incident commander. The clock is ticking — every minute of downtime costs the company approximately $2,000 in lost revenue. Your team is watching you in the incident Slack channel.

The environment consists of a frontend pod, a payment API pod, and a database pod. Something is broken. You need to follow the incident response framework: detect, triage, diagnose, mitigate, resolve, and write a postmortem. The setup script will deploy a broken application stack, and your job is to work through the incident lifecycle systematically.

Objectives¶

Triage: identify which component is failing (check pod status, logs, events)
Diagnose: determine the root cause from evidence (not guessing)
Mitigate: apply a temporary fix to restore service (within 15 minutes)
Verify: confirm the error rate has dropped (all pods healthy)
Timeline: write an incident timeline to /tmp/lab-incident/timeline.txt
Postmortem: write a postmortem to /tmp/lab-incident/postmortem.txt

Setup¶

./setup.sh

Deploys a broken application stack in namespace lab-incident.

Hints¶

Hint 1: Triage steps

Start with `kubectl get pods -n lab-incident` to see status, then `kubectl describe pod -n lab-incident` for events, and `kubectl logs -n lab-incident` for application output.

Hint 2: Common K8s failures

Check for: CrashLoopBackOff, ImagePullBackOff, OOMKilled, failed probes, missing ConfigMaps/Secrets, resource exhaustion, and wrong environment variables.

Hint 3: Timeline format

14:00 - P1 alert fired: payment failure rate > 50%
14:02 - Incident commander assigned (you)
14:05 - Triage: identified payment-api pod in CrashLoopBackOff
14:08 - Diagnosis: ...

Hint 4: Postmortem structure

Include: Title, Date, Duration, Impact, Root Cause, Timeline, Action Items. Focus on systemic improvements, not blame.

Hint 5: Quick mitigation

If a pod is crash-looping, check if the issue is configuration (fixable via kubectl edit or patch) or code (might need a rollback to a working image).

Grading¶

./grade.sh

Solution¶

See the solution/ directory for the incident walkthrough.