On-Call Survival: Kubernetes¶
Print this. Pin it. Read it at 3 AM.
Alert: Pod CrashLoopBackOff¶
Severity: P1 (if replicas = 0) / P2 (if some replicas up)
First command:
What you're looking for: The last error message before the crash (OOM kill, panic, missing env var, connection refused).Decision tree:
Is it OOMKilled?
├── Yes → kubectl top pod <pod> -n <ns>; raise memory limit (runbook: runbooks/crashloopbackoff.md)
└── No → Is it a config/env error ("no such file", "connection refused", "env not set")?
├── Yes → Check ConfigMap/Secret mounts: kubectl describe pod <pod> -n <ns>
│ Fix config or rotate secret, then: kubectl rollout restart deploy/<name> -n <ns>
└── No → Is it a bad image?
├── Yes → kubectl set image deploy/<name> <container>=<previous-tag> -n <ns>
└── No → Escalate to app team: "Pod <name> in <ns> crashing, logs show: <last error>"
Escalation trigger: Pod has been crash-looping > 10 min, or replicas = 0 affecting user traffic.
Safe actions: Read logs, describe pod, describe node, get events.
Dangerous actions: kubectl delete pod (triggers restart), kubectl rollout restart (restarts all replicas), edit resource limits.
Alert: Node NotReady¶
Severity: P1 (many nodes) / P2 (single node)
First command:
What you're looking for: Conditions section — DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable. Check Events at the bottom.Decision tree:
Is it DiskPressure?
├── Yes → SSH to node, run: df -h; du -sh /var/lib/docker/* | sort -rh | head
│ Prune images: crictl rmi --prune; or drain and cordon
└── No → Is it MemoryPressure?
├── Yes → kubectl top nodes; look for noisy pod. Drain or wait for eviction.
└── No → Is it NetworkUnavailable?
├── Yes → Check CNI: systemctl status kubelet; journalctl -u kubelet -n 50
└── No → SSH to node; check: systemctl status kubelet. If dead → escalate to infra team.
Escalation trigger: Multiple nodes NotReady, cluster autoscaler not replacing node after 5 min, kubelet cannot be restarted.
Safe actions: Describe node, cordon node (kubectl cordon <node> — stops scheduling, does not evict), get events.
Dangerous actions: kubectl drain <node> (evicts all pods — check PDBs first), node reboot.
Alert: Deployment Stuck / Rollout Not Progressing¶
Severity: P2
First command:
What you're looking for: "Waiting for deployment ... rollout to finish: X out of Y new replicas have been updated."Decision tree:
Are new pods Pending (not scheduling)?
├── Yes → kubectl describe pod <new-pod> -n <ns> | grep -A 10 Events
│ → Insufficient resources? Scale cluster or reduce requests.
│ → Taints/tolerations mismatch? Fix deployment spec.
└── No → Are new pods CrashLoopBackOff?
├── Yes → Go to CrashLoopBackOff section above.
└── No → Are old pods not terminating?
├── Yes → Check PodDisruptionBudget: kubectl get pdb -n <ns>
│ PDB blocking? Temporarily scale up old deployment or wait.
└── No → Check image pull: kubectl describe pod <pod> -n <ns> | grep -i imagepull
Registry issue? → See CI/CD guide.
Escalation trigger: Rollout stuck > 15 min; production traffic degraded; PDB cannot be satisfied.
Safe actions: Check rollout status, describe pods, check PDB.
Dangerous actions: kubectl rollout undo deploy/<name> (immediate rollback — coordinate with deploy team).
Alert: PersistentVolumeClaim Pending¶
Severity: P2
First command:
What you're looking for: Events section — "no persistent volumes available", "storageclass not found", "volume binding failed."Decision tree:
Is StorageClass missing or wrong?
├── Yes → kubectl get storageclass; fix PVC spec to match available class
└── No → Is it "no volumes available"?
├── Yes → kubectl get pv; check available PVs. Dynamic provisioning? Check provisioner pod.
└── No → Is it in a different AZ than the pod?
├── Yes → Topology mismatch — escalate to infra. May need zone-specific StorageClass.
└── No → kubectl describe storageclass <name>; check provisioner logs.
Escalate to storage/infra team.
Escalation trigger: Database pod cannot start due to PVC pending; no PV provisioner running.
Safe actions: Describe PVC, describe PV, get storage classes.
Dangerous actions: Delete and recreate PVC (data loss if not backed up), patch PV reclaim policy.
Alert: OOMKilled Pod¶
Severity: P2
First command:
What you're looking for: Pod consuming near or above its memory limit.Decision tree:
Is this a sudden spike or gradual leak?
├── Sudden spike → Check recent deploys: kubectl rollout history deploy/<name> -n <ns>
│ New version? → kubectl rollout undo deploy/<name> -n <ns>
└── Gradual leak → Raise memory limit as temp fix:
kubectl set resources deploy/<name> --limits=memory=<new-limit> -n <ns>
File ticket for dev team to fix leak.
Escalation trigger: OOMKill loop preventing service from running; memory limits are already at maximum.
Safe actions: kubectl top pod, check rollout history.
Dangerous actions: Raise memory limits (may starve other pods), rollback.
Quick Reference¶
Most Useful Commands¶
# All pods in a namespace with status
kubectl get pods -n <ns> -o wide
# Pod logs (live)
kubectl logs -n <ns> <pod> -f
# Previous container logs (after crash)
kubectl logs -n <ns> <pod> --previous
# Events for a namespace (recent problems)
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20
# Resource usage by pod
kubectl top pods -n <ns> --sort-by=memory
# Resource usage by node
kubectl top nodes
# Describe a pod (events, mounts, limits)
kubectl describe pod -n <ns> <pod>
# Rollout history
kubectl rollout history deploy/<name> -n <ns>
# Immediate rollback
kubectl rollout undo deploy/<name> -n <ns>
# Cordon a node (stop scheduling, no eviction)
kubectl cordon <node>
# Drain a node (evict all pods — check PDBs first)
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
Escalation Contacts¶
| Situation | Team | Channel |
|---|---|---|
| Node failure / infra issue | Infra / Platform | #infra-oncall |
| App crash (known bug) | App team | #dev-oncall |
| Cluster-wide outage | Platform lead | PagerDuty: platform-critical |
| Storage / PVC issue | Storage / Infra | #infra-oncall |
Safe vs Dangerous Actions¶
| Safe (do without asking) | Dangerous (get approval) |
|---|---|
| Read logs | Restart production pods |
| Describe resources | Scale down replicas |
| Get events | Delete pods or PVCs |
| Cordon a node | Drain a node |
| Check rollout status | Roll back a deployment |
| Top pods/nodes | Edit resource limits |