Skip to content

Operational Runbooks

Step-by-step procedures for common operational incidents. Designed for engineers on their first on-call rotation — no tribal knowledge required.

How to Use

  1. Alert fires → find the matching runbook by domain and trigger below
  2. Quick Assessment → run the 30-second triage command first
  3. Follow steps in order — do not skip steps
  4. Hit escalation trigger? → page the next person immediately, don't wait
  5. After resolution → update the runbook if steps were wrong or missing

Runbook Index

Kubernetes

Runbook Alert Trigger Severity Est. Time
Pod CrashLoopBackOff container_restart_rate > 5/min P2 15-30 min
Node NotReady kube_node_status_condition{Ready=false} P1 20-45 min
Deployment Stuck / Rollout Stalled Available replicas < desired for >10 min P2 15-30 min
PVC Stuck Pending PVC in Pending state >5 min P2 15-25 min
OOMKilled Container Exit code 137 / OOMKilled event P2 10-20 min
Ingress 502 Bad Gateway HTTP 502 rate elevated P1 15-30 min
HPA Thrashing Replica count changing rapidly P3 20-40 min
etcd High Latency etcd WAL fsync >10ms sustained P1 30-60 min

Networking

Runbook Alert Trigger Severity Est. Time
DNS Resolution Failure Service name not resolving P1 15-30 min
TLS Certificate Expiry ssl_certificate_expiry_seconds < 604800 P1/P2 30-60 min
Load Balancer Health Check Failure Unhealthy LB targets P1 20-40 min
Network Partition Asymmetric connectivity, elevated errors P1 30-90 min
MTU Mismatch Large packet drops, big transfers fail P2 30-60 min

Linux

Runbook Alert Trigger Severity Est. Time
Disk Full filesystem_avail < 10% P1/P2 15-30 min
OOM Killer Activated node_vmstat_oom_kill > 0 P1 20-40 min
High CPU / Runaway Process node_cpu_idle < 10% sustained P2 15-30 min
Zombie Processes node_processes_state{Z} > 10 P3 20-40 min
Systemd Service Crash Loop Service unit repeatedly restarting P2 15-30 min

Databases

Runbook Alert Trigger Severity Est. Time
PostgreSQL Connection Exhaustion Connections ≥ 90% of max_connections P1 15-30 min
PostgreSQL Replication Lag pg_replication_lag_seconds > 30 P2 20-45 min
Long-Running Query / Lock Contention Transaction duration >5 min P2 15-30 min
PostgreSQL Disk Space Critical DB volume <10% free P1 20-45 min

Observability

Runbook Alert Trigger Severity Est. Time
Prometheus Target Down up == 0 for >2 min P2 15-30 min
Grafana Dashboard Blank Panels show "No data" P2 15-30 min
Alert Storm >20 alerts firing simultaneously P1 20-45 min
Log Pipeline Backpressure Logs missing in Grafana >5 min P2 20-40 min

CI/CD

Runbook Alert Trigger Severity Est. Time
Build Failure Triage Main branch pipeline failing P2 15-30 min
Deploy Rollback Elevated errors after deployment P1 10-20 min
Container Registry Pull Failure ImagePullBackOff pods P1/P2 15-30 min
Pipeline Stuck / Hung Job Job running >2x expected duration P2 15-30 min

Security

Runbook Alert Trigger Severity Est. Time
Credential Rotation (Exposed Secret) Secret detected in logs/git/public P1 30-60 min
CVE Response Critical/high CVE in production image P1/P2/P3 1-4 hours
Unauthorized Access Investigation Anomalous API calls or security alert P1 1-2 hours

Cloud / Terraform

Runbook Alert Trigger Severity Est. Time
Terraform State Lock Stuck terraform plan fails with state lock error P2 15-30 min
Terraform Drift Detection terraform plan shows unexpected changes P2 30-60 min
Cloud Capacity Limit Hit Resource creation fails with quota error P1/P2 30-120 min

Additional Runbooks

The following runbooks use the shorter format (Symptoms / Fast Triage / Fix / Prevention). They are still valid and actively maintained.

Runbook Domain Trigger
ImagePullBackOff Kubernetes Can't pull container image
Readiness Probe Failed Kubernetes Pod running but not ready
Pod Eviction Kubernetes Pods evicted due to node resource pressure
Ingress 404 Kubernetes Ingress returning 404 for valid paths
NetworkPolicy Block Kubernetes Traffic blocked by network policy
Istio 503 Errors Kubernetes Service mesh returning 503s
Helm Upgrade Failed CI/CD Helm release stuck or failed
HPA Not Scaling Kubernetes Autoscaler not adding pods
RBAC Forbidden Kubernetes 403 errors from API server
Cert Renewal Failed Security TLS certificate expired or failing renewal
Secret Rotation Security Credential rotation procedure
Kyverno Blocking Workloads Security Policy engine rejecting deployments
Loki No Logs Observability Missing logs in Loki/Grafana
Tempo No Traces Observability Missing traces in Tempo
ArgoCD Out of Sync CI/CD Application drifted from Git state
VPC IP Exhaustion Cloud Running out of pod/node IPs
etcd Backup/Restore Kubernetes etcd data loss or corruption
Disaster Recovery Kubernetes Full cluster recovery procedure
Velero Backup/Restore Kubernetes Cluster backup and restore with Velero