Operational Runbooks¶
Step-by-step procedures for common operational incidents. Designed for engineers on their first on-call rotation — no tribal knowledge required.
How to Use¶
- Alert fires → find the matching runbook by domain and trigger below
- Quick Assessment → run the 30-second triage command first
- Follow steps in order — do not skip steps
- Hit escalation trigger? → page the next person immediately, don't wait
- After resolution → update the runbook if steps were wrong or missing
Runbook Index¶
Kubernetes¶
| Runbook | Alert Trigger | Severity | Est. Time |
|---|---|---|---|
| Pod CrashLoopBackOff | container_restart_rate > 5/min |
P2 | 15-30 min |
| Node NotReady | kube_node_status_condition{Ready=false} |
P1 | 20-45 min |
| Deployment Stuck / Rollout Stalled | Available replicas < desired for >10 min | P2 | 15-30 min |
| PVC Stuck Pending | PVC in Pending state >5 min | P2 | 15-25 min |
| OOMKilled Container | Exit code 137 / OOMKilled event | P2 | 10-20 min |
| Ingress 502 Bad Gateway | HTTP 502 rate elevated | P1 | 15-30 min |
| HPA Thrashing | Replica count changing rapidly | P3 | 20-40 min |
| etcd High Latency | etcd WAL fsync >10ms sustained | P1 | 30-60 min |
Networking¶
| Runbook | Alert Trigger | Severity | Est. Time |
|---|---|---|---|
| DNS Resolution Failure | Service name not resolving | P1 | 15-30 min |
| TLS Certificate Expiry | ssl_certificate_expiry_seconds < 604800 |
P1/P2 | 30-60 min |
| Load Balancer Health Check Failure | Unhealthy LB targets | P1 | 20-40 min |
| Network Partition | Asymmetric connectivity, elevated errors | P1 | 30-90 min |
| MTU Mismatch | Large packet drops, big transfers fail | P2 | 30-60 min |
Linux¶
| Runbook | Alert Trigger | Severity | Est. Time |
|---|---|---|---|
| Disk Full | filesystem_avail < 10% |
P1/P2 | 15-30 min |
| OOM Killer Activated | node_vmstat_oom_kill > 0 |
P1 | 20-40 min |
| High CPU / Runaway Process | node_cpu_idle < 10% sustained |
P2 | 15-30 min |
| Zombie Processes | node_processes_state{Z} > 10 |
P3 | 20-40 min |
| Systemd Service Crash Loop | Service unit repeatedly restarting | P2 | 15-30 min |
Databases¶
| Runbook | Alert Trigger | Severity | Est. Time |
|---|---|---|---|
| PostgreSQL Connection Exhaustion | Connections ≥ 90% of max_connections | P1 | 15-30 min |
| PostgreSQL Replication Lag | pg_replication_lag_seconds > 30 |
P2 | 20-45 min |
| Long-Running Query / Lock Contention | Transaction duration >5 min | P2 | 15-30 min |
| PostgreSQL Disk Space Critical | DB volume <10% free | P1 | 20-45 min |
Observability¶
| Runbook | Alert Trigger | Severity | Est. Time |
|---|---|---|---|
| Prometheus Target Down | up == 0 for >2 min |
P2 | 15-30 min |
| Grafana Dashboard Blank | Panels show "No data" | P2 | 15-30 min |
| Alert Storm | >20 alerts firing simultaneously | P1 | 20-45 min |
| Log Pipeline Backpressure | Logs missing in Grafana >5 min | P2 | 20-40 min |
CI/CD¶
| Runbook | Alert Trigger | Severity | Est. Time |
|---|---|---|---|
| Build Failure Triage | Main branch pipeline failing | P2 | 15-30 min |
| Deploy Rollback | Elevated errors after deployment | P1 | 10-20 min |
| Container Registry Pull Failure | ImagePullBackOff pods | P1/P2 | 15-30 min |
| Pipeline Stuck / Hung Job | Job running >2x expected duration | P2 | 15-30 min |
Security¶
| Runbook | Alert Trigger | Severity | Est. Time |
|---|---|---|---|
| Credential Rotation (Exposed Secret) | Secret detected in logs/git/public | P1 | 30-60 min |
| CVE Response | Critical/high CVE in production image | P1/P2/P3 | 1-4 hours |
| Unauthorized Access Investigation | Anomalous API calls or security alert | P1 | 1-2 hours |
Cloud / Terraform¶
| Runbook | Alert Trigger | Severity | Est. Time |
|---|---|---|---|
| Terraform State Lock Stuck | terraform plan fails with state lock error |
P2 | 15-30 min |
| Terraform Drift Detection | terraform plan shows unexpected changes |
P2 | 30-60 min |
| Cloud Capacity Limit Hit | Resource creation fails with quota error | P1/P2 | 30-120 min |
Additional Runbooks¶
The following runbooks use the shorter format (Symptoms / Fast Triage / Fix / Prevention). They are still valid and actively maintained.
| Runbook | Domain | Trigger |
|---|---|---|
| ImagePullBackOff | Kubernetes | Can't pull container image |
| Readiness Probe Failed | Kubernetes | Pod running but not ready |
| Pod Eviction | Kubernetes | Pods evicted due to node resource pressure |
| Ingress 404 | Kubernetes | Ingress returning 404 for valid paths |
| NetworkPolicy Block | Kubernetes | Traffic blocked by network policy |
| Istio 503 Errors | Kubernetes | Service mesh returning 503s |
| Helm Upgrade Failed | CI/CD | Helm release stuck or failed |
| HPA Not Scaling | Kubernetes | Autoscaler not adding pods |
| RBAC Forbidden | Kubernetes | 403 errors from API server |
| Cert Renewal Failed | Security | TLS certificate expired or failing renewal |
| Secret Rotation | Security | Credential rotation procedure |
| Kyverno Blocking Workloads | Security | Policy engine rejecting deployments |
| Loki No Logs | Observability | Missing logs in Loki/Grafana |
| Tempo No Traces | Observability | Missing traces in Tempo |
| ArgoCD Out of Sync | CI/CD | Application drifted from Git state |
| VPC IP Exhaustion | Cloud | Running out of pod/node IPs |
| etcd Backup/Restore | Kubernetes | etcd data loss or corruption |
| Disaster Recovery | Kubernetes | Full cluster recovery procedure |
| Velero Backup/Restore | Kubernetes | Cluster backup and restore with Velero |
Related¶
- Case Studies — real incident write-ups (60 studies, 4 domains)
- Cheatsheets — quick command reference
- Topics — deep background on each domain
- Interview Scenarios — practice explaining these incidents
Pages that link here¶
- Cheat Sheets
- Decision Trees
- Drills
- Interview Scenarios
- Kubernetes Ops Domain
- Level 5: SRE & Incident Response
- Master Curriculum: 40 Weeks
- Runbook: Alert Storm (Flapping / Too Many Alerts)
- Runbook: ArgoCD Application OutOfSync
- Runbook: Build Failure Triage
- Runbook: CVE Response (Critical Vulnerability)
- Runbook: Certificate Renewal Failed
- Runbook: Cloud Capacity Limit Hit
- Runbook: Container Registry Pull Failure
- Runbook: Credential Rotation (Exposed Secret)