Operational Runbooks¶

Step-by-step procedures for common operational incidents. Designed for engineers on their first on-call rotation — no tribal knowledge required.

How to Use¶

Runbook	Alert Trigger	Severity	Est. Time
Pod CrashLoopBackOff	`container_restart_rate > 5/min`	P2	15-30 min
Node NotReady	`kube_node_status_condition{Ready=false}`	P1	20-45 min
Deployment Stuck / Rollout Stalled	Available replicas < desired for >10 min	P2	15-30 min
PVC Stuck Pending	PVC in Pending state >5 min	P2	15-25 min
OOMKilled Container	Exit code 137 / OOMKilled event	P2	10-20 min
Ingress 502 Bad Gateway	HTTP 502 rate elevated	P1	15-30 min
HPA Thrashing	Replica count changing rapidly	P3	20-40 min
etcd High Latency	etcd WAL fsync >10ms sustained	P1	30-60 min

Runbook	Alert Trigger	Severity	Est. Time
DNS Resolution Failure	Service name not resolving	P1	15-30 min
TLS Certificate Expiry	`ssl_certificate_expiry_seconds < 604800`	P1/P2	30-60 min
Load Balancer Health Check Failure	Unhealthy LB targets	P1	20-40 min
Network Partition	Asymmetric connectivity, elevated errors	P1	30-90 min
MTU Mismatch	Large packet drops, big transfers fail	P2	30-60 min

Runbook	Alert Trigger	Severity	Est. Time
Disk Full	`filesystem_avail < 10%`	P1/P2	15-30 min
OOM Killer Activated	`node_vmstat_oom_kill > 0`	P1	20-40 min
High CPU / Runaway Process	`node_cpu_idle < 10%` sustained	P2	15-30 min
Zombie Processes	`node_processes_state{Z} > 10`	P3	20-40 min
Systemd Service Crash Loop	Service unit repeatedly restarting	P2	15-30 min

Runbook	Alert Trigger	Severity	Est. Time
PostgreSQL Connection Exhaustion	Connections ≥ 90% of max_connections	P1	15-30 min
PostgreSQL Replication Lag	`pg_replication_lag_seconds > 30`	P2	20-45 min
Long-Running Query / Lock Contention	Transaction duration >5 min	P2	15-30 min
PostgreSQL Disk Space Critical	DB volume <10% free	P1	20-45 min

Runbook	Alert Trigger	Severity	Est. Time
Prometheus Target Down	`up == 0` for >2 min	P2	15-30 min
Grafana Dashboard Blank	Panels show "No data"	P2	15-30 min
Alert Storm	>20 alerts firing simultaneously	P1	20-45 min
Log Pipeline Backpressure	Logs missing in Grafana >5 min	P2	20-40 min

Runbook	Alert Trigger	Severity	Est. Time
Build Failure Triage	Main branch pipeline failing	P2	15-30 min
Deploy Rollback	Elevated errors after deployment	P1	10-20 min
Container Registry Pull Failure	ImagePullBackOff pods	P1/P2	15-30 min
Pipeline Stuck / Hung Job	Job running >2x expected duration	P2	15-30 min

Runbook	Alert Trigger	Severity	Est. Time
Credential Rotation (Exposed Secret)	Secret detected in logs/git/public	P1	30-60 min
CVE Response	Critical/high CVE in production image	P1/P2/P3	1-4 hours
Unauthorized Access Investigation	Anomalous API calls or security alert	P1	1-2 hours

Runbook	Alert Trigger	Severity	Est. Time
Terraform State Lock Stuck	`terraform plan` fails with state lock error	P2	15-30 min
Terraform Drift Detection	`terraform plan` shows unexpected changes	P2	30-60 min
Cloud Capacity Limit Hit	Resource creation fails with quota error	P1/P2	30-120 min

The following runbooks use the shorter format (Symptoms / Fast Triage / Fix / Prevention). They are still valid and actively maintained.

Runbook	Domain	Trigger
ImagePullBackOff	Kubernetes	Can't pull container image
Readiness Probe Failed	Kubernetes	Pod running but not ready
Pod Eviction	Kubernetes	Pods evicted due to node resource pressure
Ingress 404	Kubernetes	Ingress returning 404 for valid paths
NetworkPolicy Block	Kubernetes	Traffic blocked by network policy
Istio 503 Errors	Kubernetes	Service mesh returning 503s
Helm Upgrade Failed	CI/CD	Helm release stuck or failed
HPA Not Scaling	Kubernetes	Autoscaler not adding pods
RBAC Forbidden	Kubernetes	403 errors from API server
Cert Renewal Failed	Security	TLS certificate expired or failing renewal
Secret Rotation	Security	Credential rotation procedure
Kyverno Blocking Workloads	Security	Policy engine rejecting deployments
Loki No Logs	Observability	Missing logs in Loki/Grafana
Tempo No Traces	Observability	Missing traces in Tempo
ArgoCD Out of Sync	CI/CD	Application drifted from Git state
VPC IP Exhaustion	Cloud	Running out of pod/node IPs
etcd Backup/Restore	Kubernetes	etcd data loss or corruption
Disaster Recovery	Kubernetes	Full cluster recovery procedure
Velero Backup/Restore	Kubernetes	Cluster backup and restore with Velero