On-Call Survival Guides¶

Pocket-card reference for your first on-call rotation. Each guide: what to check first, in what order, and when to escalate.

Designed for 3 AM. Terse. Scannable. No fluff.

Domain	Common Alerts	Guide
Kubernetes	Pod crash, node down, deploy stuck, PVC pending, OOMKill	kubernetes.md
Networking	DNS failure, TLS error, ingress 404/502, cert expiry	networking.md
Linux/OS	Disk full, OOM, CPU spike, zombie processes, systemd failure	linux.md
Databases	Connection exhaustion, replication lag, disk full, long-running query, locks	databases.md
Observability	Alert storm, missing metrics, Grafana blank, Loki gap, Tempo gap	observability.md
CI/CD	Build failure, deploy failure, rollback needed, registry issue	cicd.md
Security	Compromised creds, unauthorized access, CVE alert, cert issue	security.md
Cloud/Infrastructure	Provider outage, Terraform drift, capacity limit, cost spike	cloud-infrastructure.md

How to use these guides¶

Find your alert in the domain guide.
Run the first command — it tells you what's actually wrong.
Follow the decision tree — each branch ends in an action or an escalation.
Know your limits — Safe vs Dangerous tables tell you what you can do alone.
Escalate early. Better to page someone and not need them than to need them and not page.

Shift handoff template¶

Status: [GREEN / YELLOW / RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]

Full runbooks¶

training/library/runbooks/ — comprehensive step-by-step procedures. training/library/decision-trees/ — structured decision flows per scenario.

On-Call Survival Guides¶

How to use these guides¶

Shift handoff template¶

Full runbooks¶

Pages that link here¶