Skip to content

On-Call Survival: Kubernetes

Print this. Pin it. Read it at 3 AM.


Alert: Pod CrashLoopBackOff

Severity: P1 (if replicas = 0) / P2 (if some replicas up)

First command:

kubectl logs -n <ns> <pod> --previous
What you're looking for: The last error message before the crash (OOM kill, panic, missing env var, connection refused).

Decision tree:

Is it OOMKilled?
├── Yes  kubectl top pod <pod> -n <ns>; raise memory limit (runbook: runbooks/crashloopbackoff.md)
└── No  Is it a config/env error ("no such file", "connection refused", "env not set")?
    ├── Yes  Check ConfigMap/Secret mounts: kubectl describe pod <pod> -n <ns>
             Fix config or rotate secret, then: kubectl rollout restart deploy/<name> -n <ns>
    └── No  Is it a bad image?
        ├── Yes  kubectl set image deploy/<name> <container>=<previous-tag> -n <ns>
        └── No  Escalate to app team: "Pod <name> in <ns> crashing, logs show: <last error>"

Escalation trigger: Pod has been crash-looping > 10 min, or replicas = 0 affecting user traffic.

Safe actions: Read logs, describe pod, describe node, get events.

Dangerous actions: kubectl delete pod (triggers restart), kubectl rollout restart (restarts all replicas), edit resource limits.


Alert: Node NotReady

Severity: P1 (many nodes) / P2 (single node)

First command:

kubectl describe node <node-name> | tail -40
What you're looking for: Conditions section — DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable. Check Events at the bottom.

Decision tree:

Is it DiskPressure?
├── Yes  SSH to node, run: df -h; du -sh /var/lib/docker/* | sort -rh | head
         Prune images: crictl rmi --prune; or drain and cordon
└── No  Is it MemoryPressure?
    ├── Yes  kubectl top nodes; look for noisy pod. Drain or wait for eviction.
    └── No  Is it NetworkUnavailable?
        ├── Yes  Check CNI: systemctl status kubelet; journalctl -u kubelet -n 50
        └── No  SSH to node; check: systemctl status kubelet. If dead  escalate to infra team.

Escalation trigger: Multiple nodes NotReady, cluster autoscaler not replacing node after 5 min, kubelet cannot be restarted.

Safe actions: Describe node, cordon node (kubectl cordon <node> — stops scheduling, does not evict), get events.

Dangerous actions: kubectl drain <node> (evicts all pods — check PDBs first), node reboot.


Alert: Deployment Stuck / Rollout Not Progressing

Severity: P2

First command:

kubectl rollout status deploy/<name> -n <ns>
What you're looking for: "Waiting for deployment ... rollout to finish: X out of Y new replicas have been updated."

Decision tree:

Are new pods Pending (not scheduling)?
├── Yes  kubectl describe pod <new-pod> -n <ns> | grep -A 10 Events
          Insufficient resources? Scale cluster or reduce requests.
          Taints/tolerations mismatch? Fix deployment spec.
└── No  Are new pods CrashLoopBackOff?
    ├── Yes  Go to CrashLoopBackOff section above.
    └── No  Are old pods not terminating?
        ├── Yes  Check PodDisruptionBudget: kubectl get pdb -n <ns>
                 PDB blocking? Temporarily scale up old deployment or wait.
        └── No  Check image pull: kubectl describe pod <pod> -n <ns> | grep -i imagepull
                 Registry issue?  See CI/CD guide.

Escalation trigger: Rollout stuck > 15 min; production traffic degraded; PDB cannot be satisfied.

Safe actions: Check rollout status, describe pods, check PDB.

Dangerous actions: kubectl rollout undo deploy/<name> (immediate rollback — coordinate with deploy team).


Alert: PersistentVolumeClaim Pending

Severity: P2

First command:

kubectl describe pvc <name> -n <ns>
What you're looking for: Events section — "no persistent volumes available", "storageclass not found", "volume binding failed."

Decision tree:

Is StorageClass missing or wrong?
├── Yes  kubectl get storageclass; fix PVC spec to match available class
└── No  Is it "no volumes available"?
    ├── Yes  kubectl get pv; check available PVs. Dynamic provisioning? Check provisioner pod.
    └── No  Is it in a different AZ than the pod?
        ├── Yes  Topology mismatch  escalate to infra. May need zone-specific StorageClass.
        └── No  kubectl describe storageclass <name>; check provisioner logs.
                 Escalate to storage/infra team.

Escalation trigger: Database pod cannot start due to PVC pending; no PV provisioner running.

Safe actions: Describe PVC, describe PV, get storage classes.

Dangerous actions: Delete and recreate PVC (data loss if not backed up), patch PV reclaim policy.


Alert: OOMKilled Pod

Severity: P2

First command:

kubectl top pod -n <ns> --sort-by=memory | head -20
What you're looking for: Pod consuming near or above its memory limit.

Decision tree:

Is this a sudden spike or gradual leak?
├── Sudden spike  Check recent deploys: kubectl rollout history deploy/<name> -n <ns>
                  New version?  kubectl rollout undo deploy/<name> -n <ns>
└── Gradual leak  Raise memory limit as temp fix:
                   kubectl set resources deploy/<name> --limits=memory=<new-limit> -n <ns>
                   File ticket for dev team to fix leak.

Escalation trigger: OOMKill loop preventing service from running; memory limits are already at maximum.

Safe actions: kubectl top pod, check rollout history.

Dangerous actions: Raise memory limits (may starve other pods), rollback.


Quick Reference

Most Useful Commands

# All pods in a namespace with status
kubectl get pods -n <ns> -o wide

# Pod logs (live)
kubectl logs -n <ns> <pod> -f

# Previous container logs (after crash)
kubectl logs -n <ns> <pod> --previous

# Events for a namespace (recent problems)
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20

# Resource usage by pod
kubectl top pods -n <ns> --sort-by=memory

# Resource usage by node
kubectl top nodes

# Describe a pod (events, mounts, limits)
kubectl describe pod -n <ns> <pod>

# Rollout history
kubectl rollout history deploy/<name> -n <ns>

# Immediate rollback
kubectl rollout undo deploy/<name> -n <ns>

# Cordon a node (stop scheduling, no eviction)
kubectl cordon <node>

# Drain a node (evict all pods — check PDBs first)
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

Escalation Contacts

Situation Team Channel
Node failure / infra issue Infra / Platform #infra-oncall
App crash (known bug) App team #dev-oncall
Cluster-wide outage Platform lead PagerDuty: platform-critical
Storage / PVC issue Storage / Infra #infra-oncall

Safe vs Dangerous Actions

Safe (do without asking) Dangerous (get approval)
Read logs Restart production pods
Describe resources Scale down replicas
Get events Delete pods or PVCs
Cordon a node Drain a node
Check rollout status Roll back a deployment
Top pods/nodes Edit resource limits

Shift Handoff Template

Status: [GREEN/YELLOW/RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]