Lab 22: Incident Simulation¶
| Field | Value |
|---|---|
| Tier | 5 — Capstone |
| Estimated Time | 2 hours |
| Prerequisites | Incident Response lab (Lab 15) |
| Auto-Grade | Yes |
Scenario¶
This is a full-scale incident simulation. You are the on-call engineer at a fintech company. At 09:00, three alerts fire simultaneously: the payment gateway is returning 503s, the order service is logging database connection errors, and customer-facing latency has spiked to 15 seconds. The CEO is in a board meeting in 90 minutes and the VP of Engineering needs a status update in 30 minutes.
Unlike Lab 15, this simulation has multiple cascading failures. The root cause is not immediately obvious — you need to correlate information from multiple services, read through logs, check metrics, and think systematically. There are also red herrings (a recent deployment that looks suspicious but is not the cause, and a configuration change that seems related but is not).
The simulation runs on a timer. You have 90 minutes to detect, triage, mitigate, and resolve the incident. The grading script checks your timeline, your diagnosis accuracy, and the final system health.
Objectives¶
- Identify all three failing services and their symptoms
- Correctly identify the root cause (not the red herrings)
- Apply a mitigation that restores service within 30 minutes
- Verify all services return to healthy state
- Write a timeline to
/tmp/lab-incident-sim/timeline.txt - Write a status update (for VP) to
/tmp/lab-incident-sim/status-update.txt - Write a postmortem to
/tmp/lab-incident-sim/postmortem.txt - Postmortem includes at least 3 action items
Setup¶
Deploys a multi-service stack with cascading failures in namespace lab-incident-sim.
Hints¶
Hint 1: Start with the symptoms
Check all services: `kubectl get pods -n lab-incident-sim`. Look for CrashLoopBackOff, high restart counts, and Pending pods. Then check events and logs.Hint 2: Red herrings
Not everything that looks suspicious is the cause. A recent deployment might have happened right before the incident but not caused it. Focus on evidence, not coincidence.Hint 3: Cascading failures
One root cause can cascade through multiple services. Fix the root cause first, not the symptoms. If the database is down, fixing the API config will not help.Hint 4: Status update format
Short and actionable for executives: What is broken, who is affected, what are we doing, when do we expect resolution, what is the customer impact.Hint 5: Postmortem action items
Good action items are: specific, assigned, measurable, and time-bound. "Improve monitoring" is bad. "Add PagerDuty alert for DB connection pool > 80% by next sprint" is good.Grading¶
Solution¶
See the solution/ directory for the incident walkthrough.