Lab 22: Incident Simulation¶

Field	Value
Tier	5 — Capstone
Estimated Time	2 hours
Prerequisites	Incident Response lab (Lab 15)
Auto-Grade	Yes

Scenario¶

This is a full-scale incident simulation. You are the on-call engineer at a fintech company. At 09:00, three alerts fire simultaneously: the payment gateway is returning 503s, the order service is logging database connection errors, and customer-facing latency has spiked to 15 seconds. The CEO is in a board meeting in 90 minutes and the VP of Engineering needs a status update in 30 minutes.

Unlike Lab 15, this simulation has multiple cascading failures. The root cause is not immediately obvious — you need to correlate information from multiple services, read through logs, check metrics, and think systematically. There are also red herrings (a recent deployment that looks suspicious but is not the cause, and a configuration change that seems related but is not).

The simulation runs on a timer. You have 90 minutes to detect, triage, mitigate, and resolve the incident. The grading script checks your timeline, your diagnosis accuracy, and the final system health.

Objectives¶

Identify all three failing services and their symptoms
Correctly identify the root cause (not the red herrings)
Apply a mitigation that restores service within 30 minutes
Verify all services return to healthy state
Write a timeline to /tmp/lab-incident-sim/timeline.txt
Write a status update (for VP) to /tmp/lab-incident-sim/status-update.txt
Write a postmortem to /tmp/lab-incident-sim/postmortem.txt
Postmortem includes at least 3 action items

Setup¶

./setup.sh

Deploys a multi-service stack with cascading failures in namespace lab-incident-sim.

Hints¶

Hint 1: Start with the symptoms

Check all services: `kubectl get pods -n lab-incident-sim`. Look for CrashLoopBackOff, high restart counts, and Pending pods. Then check events and logs.

Hint 2: Red herrings

Not everything that looks suspicious is the cause. A recent deployment might have happened right before the incident but not caused it. Focus on evidence, not coincidence.

Hint 3: Cascading failures

One root cause can cascade through multiple services. Fix the root cause first, not the symptoms. If the database is down, fixing the API config will not help.

Hint 4: Status update format

Short and actionable for executives: What is broken, who is affected, what are we doing, when do we expect resolution, what is the customer impact.

Hint 5: Postmortem action items

Good action items are: specific, assigned, measurable, and time-bound. "Improve monitoring" is bad. "Add PagerDuty alert for DB connection pool > 80% by next sprint" is good.

Grading¶

./grade.sh

Solution¶

See the solution/ directory for the incident walkthrough.