On-Call Survival Guides¶
Pocket-card reference for your first on-call rotation. Each guide: what to check first, in what order, and when to escalate.
Designed for 3 AM. Terse. Scannable. No fluff.
| Domain | Common Alerts | Guide |
|---|---|---|
| Kubernetes | Pod crash, node down, deploy stuck, PVC pending, OOMKill | kubernetes.md |
| Networking | DNS failure, TLS error, ingress 404/502, cert expiry | networking.md |
| Linux/OS | Disk full, OOM, CPU spike, zombie processes, systemd failure | linux.md |
| Databases | Connection exhaustion, replication lag, disk full, long-running query, locks | databases.md |
| Observability | Alert storm, missing metrics, Grafana blank, Loki gap, Tempo gap | observability.md |
| CI/CD | Build failure, deploy failure, rollback needed, registry issue | cicd.md |
| Security | Compromised creds, unauthorized access, CVE alert, cert issue | security.md |
| Cloud/Infrastructure | Provider outage, Terraform drift, capacity limit, cost spike | cloud-infrastructure.md |
How to use these guides¶
- Find your alert in the domain guide.
- Run the first command — it tells you what's actually wrong.
- Follow the decision tree — each branch ends in an action or an escalation.
- Know your limits — Safe vs Dangerous tables tell you what you can do alone.
- Escalate early. Better to page someone and not need them than to need them and not page.
Shift handoff template¶
Status: [GREEN / YELLOW / RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]
Full runbooks¶
training/library/runbooks/ — comprehensive step-by-step procedures.
training/library/decision-trees/ — structured decision flows per scenario.