Skip to content

Lab 15: Incident Response

Field Value
Tier 3 — Operations
Estimated Time 60 minutes
Prerequisites k3s cluster
Auto-Grade Yes

Scenario

It is 2:00 PM on a Tuesday. PagerDuty fires a P1 alert: "Payment processing failure rate > 50%." You are the incident commander. The clock is ticking — every minute of downtime costs the company approximately $2,000 in lost revenue. Your team is watching you in the incident Slack channel.

The environment consists of a frontend pod, a payment API pod, and a database pod. Something is broken. You need to follow the incident response framework: detect, triage, diagnose, mitigate, resolve, and write a postmortem. The setup script will deploy a broken application stack, and your job is to work through the incident lifecycle systematically.

Objectives

  • Triage: identify which component is failing (check pod status, logs, events)
  • Diagnose: determine the root cause from evidence (not guessing)
  • Mitigate: apply a temporary fix to restore service (within 15 minutes)
  • Verify: confirm the error rate has dropped (all pods healthy)
  • Timeline: write an incident timeline to /tmp/lab-incident/timeline.txt
  • Postmortem: write a postmortem to /tmp/lab-incident/postmortem.txt

Setup

./setup.sh

Deploys a broken application stack in namespace lab-incident.

Hints

Hint 1: Triage steps Start with `kubectl get pods -n lab-incident` to see status, then `kubectl describe pod -n lab-incident` for events, and `kubectl logs -n lab-incident` for application output.
Hint 2: Common K8s failures Check for: CrashLoopBackOff, ImagePullBackOff, OOMKilled, failed probes, missing ConfigMaps/Secrets, resource exhaustion, and wrong environment variables.
Hint 3: Timeline format
14:00 - P1 alert fired: payment failure rate > 50%
14:02 - Incident commander assigned (you)
14:05 - Triage: identified payment-api pod in CrashLoopBackOff
14:08 - Diagnosis: ...
Hint 4: Postmortem structure Include: Title, Date, Duration, Impact, Root Cause, Timeline, Action Items. Focus on systemic improvements, not blame.
Hint 5: Quick mitigation If a pod is crash-looping, check if the issue is configuration (fixable via kubectl edit or patch) or code (might need a rollback to a working image).

Grading

./grade.sh

Solution

See the solution/ directory for the incident walkthrough.