Lab 24: On-Call Shift¶
| Field | Value |
|---|---|
| Tier | 5 — Capstone |
| Estimated Time | 3 hours |
| Prerequisites | Monitoring + Incident Response labs |
| Auto-Grade | Yes |
Scenario¶
Welcome to your first simulated on-call shift. Over the next three hours, you will manage a production Kubernetes environment that receives five alerts at unpredictable intervals. Some alerts are genuine incidents requiring immediate action. Others are false positives or low-priority issues that should be acknowledged but not escalated. Part of being on-call is making that judgment.
The environment consists of 6 services: a web frontend, an API gateway, a user service, a payment service, a notification service, and a database. Each alert will target one or more of these services. You need to respond to each alert within the expected response time, take appropriate action, and document everything in your shift log.
Objectives¶
- Deploy the full 6-service stack in namespace
lab-oncall - Respond to Alert 1: Pod crash-loop in user-service (fix within 15 min)
- Respond to Alert 2: High memory usage on payment-service (scale or limit)
- Respond to Alert 3: Certificate expiry warning (acknowledge, document)
- Respond to Alert 4: Database connection spike (investigate, mitigate)
- Respond to Alert 5: Disk pressure on a node (clean up or evict)
- Write shift log to
/tmp/lab-oncall/shift-log.txt - Write handoff notes to
/tmp/lab-oncall/handoff.txt
Setup¶
Deploys the full stack and triggers the first alert. Subsequent alerts are triggered by checking specific conditions.
Hints¶
Hint 1: Alert prioritization
P1 (page): Service down, customer impact. Respond in 5 min. P2 (urgent): Degraded service, potential customer impact. Respond in 15 min. P3 (info): Warning, no current impact. Acknowledge within 1 hour.Hint 2: Shift log format
Hint 3: False positives
Not every alert requires a fix. A certificate expiry warning 30 days out is P3 — acknowledge it, create a ticket, and move on. Do not spend 30 minutes on it during an on-call shift.Hint 4: Handoff notes
At the end of your shift, write notes for the next on-call: - What happened during the shift - Any ongoing issues - Things to watch - Tickets createdHint 5: When in doubt, document
If you are unsure whether an alert is real or a false positive, document your reasoning. Show your work. The next on-call will thank you.Grading¶
Solution¶
See the solution/ directory for the expected alert responses.