Skip to content

On-Call Survival Guides

Pocket-card reference for your first on-call rotation. Each guide: what to check first, in what order, and when to escalate.

Designed for 3 AM. Terse. Scannable. No fluff.


Domain Common Alerts Guide
Kubernetes Pod crash, node down, deploy stuck, PVC pending, OOMKill kubernetes.md
Networking DNS failure, TLS error, ingress 404/502, cert expiry networking.md
Linux/OS Disk full, OOM, CPU spike, zombie processes, systemd failure linux.md
Databases Connection exhaustion, replication lag, disk full, long-running query, locks databases.md
Observability Alert storm, missing metrics, Grafana blank, Loki gap, Tempo gap observability.md
CI/CD Build failure, deploy failure, rollback needed, registry issue cicd.md
Security Compromised creds, unauthorized access, CVE alert, cert issue security.md
Cloud/Infrastructure Provider outage, Terraform drift, capacity limit, cost spike cloud-infrastructure.md

How to use these guides

  1. Find your alert in the domain guide.
  2. Run the first command — it tells you what's actually wrong.
  3. Follow the decision tree — each branch ends in an action or an escalation.
  4. Know your limits — Safe vs Dangerous tables tell you what you can do alone.
  5. Escalate early. Better to page someone and not need them than to need them and not page.

Shift handoff template

Status: [GREEN / YELLOW / RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]

Full runbooks

training/library/runbooks/ — comprehensive step-by-step procedures. training/library/decision-trees/ — structured decision flows per scenario.