Lab 24: On-Call Shift¶

Field	Value
Tier	5 — Capstone
Estimated Time	3 hours
Prerequisites	Monitoring + Incident Response labs
Auto-Grade	Yes

Scenario¶

Welcome to your first simulated on-call shift. Over the next three hours, you will manage a production Kubernetes environment that receives five alerts at unpredictable intervals. Some alerts are genuine incidents requiring immediate action. Others are false positives or low-priority issues that should be acknowledged but not escalated. Part of being on-call is making that judgment.

The environment consists of 6 services: a web frontend, an API gateway, a user service, a payment service, a notification service, and a database. Each alert will target one or more of these services. You need to respond to each alert within the expected response time, take appropriate action, and document everything in your shift log.

Objectives¶

Deploy the full 6-service stack in namespace lab-oncall
Respond to Alert 1: Pod crash-loop in user-service (fix within 15 min)
Respond to Alert 2: High memory usage on payment-service (scale or limit)
Respond to Alert 3: Certificate expiry warning (acknowledge, document)
Respond to Alert 4: Database connection spike (investigate, mitigate)
Respond to Alert 5: Disk pressure on a node (clean up or evict)
Write shift log to /tmp/lab-oncall/shift-log.txt
Write handoff notes to /tmp/lab-oncall/handoff.txt

Setup¶

./setup.sh

Deploys the full stack and triggers the first alert. Subsequent alerts are triggered by checking specific conditions.

Hints¶

Hint 1: Alert prioritization

P1 (page): Service down, customer impact. Respond in 5 min. P2 (urgent): Degraded service, potential customer impact. Respond in 15 min. P3 (info): Warning, no current impact. Acknowledge within 1 hour.

Hint 2: Shift log format

09:00 — Shift start. All services green.
09:15 — ALERT: user-service CrashLoopBackOff. Priority: P1. Investigating.
09:18 — Root cause: missing ConfigMap. Fixed. Service restored.

Hint 3: False positives

Not every alert requires a fix. A certificate expiry warning 30 days out is P3 — acknowledge it, create a ticket, and move on. Do not spend 30 minutes on it during an on-call shift.

Hint 4: Handoff notes

At the end of your shift, write notes for the next on-call: - What happened during the shift - Any ongoing issues - Things to watch - Tickets created

Hint 5: When in doubt, document

If you are unsure whether an alert is real or a false positive, document your reasoning. Show your work. The next on-call will thank you.

Grading¶

./grade.sh

Solution¶

See the solution/ directory for the expected alert responses.