On-Call Survival: Kubernetes¶

Print this. Pin it. Read it at 3 AM.

Alert: Pod CrashLoopBackOff¶

Severity: P1 (if replicas = 0) / P2 (if some replicas up)

First command:

kubectl logs -n <ns> <pod> --previous

What you're looking for: The last error message before the crash (OOM kill, panic, missing env var, connection refused).

Decision tree:

Is it OOMKilled?
├── Yes → kubectl top pod <pod> -n <ns>; raise memory limit (runbook: runbooks/crashloopbackoff.md)
└── No → Is it a config/env error ("no such file", "connection refused", "env not set")?
    ├── Yes → Check ConfigMap/Secret mounts: kubectl describe pod <pod> -n <ns>
    │         Fix config or rotate secret, then: kubectl rollout restart deploy/<name> -n <ns>
    └── No → Is it a bad image?
        ├── Yes → kubectl set image deploy/<name> <container>=<previous-tag> -n <ns>
        └── No → Escalate to app team: "Pod <name> in <ns> crashing, logs show: <last error>"

Escalation trigger: Pod has been crash-looping > 10 min, or replicas = 0 affecting user traffic.

Safe actions: Read logs, describe pod, describe node, get events.

Dangerous actions: kubectl delete pod (triggers restart), kubectl rollout restart (restarts all replicas), edit resource limits.

Alert: Node NotReady¶

Severity: P1 (many nodes) / P2 (single node)

First command:

kubectl describe node <node-name> | tail -40

What you're looking for: Conditions section — DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable. Check Events at the bottom.

Decision tree:

Is it DiskPressure?
├── Yes → SSH to node, run: df -h; du -sh /var/lib/docker/* | sort -rh | head
│         Prune images: crictl rmi --prune; or drain and cordon
└── No → Is it MemoryPressure?
    ├── Yes → kubectl top nodes; look for noisy pod. Drain or wait for eviction.
    └── No → Is it NetworkUnavailable?
        ├── Yes → Check CNI: systemctl status kubelet; journalctl -u kubelet -n 50
        └── No → SSH to node; check: systemctl status kubelet. If dead → escalate to infra team.

Escalation trigger: Multiple nodes NotReady, cluster autoscaler not replacing node after 5 min, kubelet cannot be restarted.

Safe actions: Describe node, cordon node (kubectl cordon <node> — stops scheduling, does not evict), get events.

Dangerous actions: kubectl drain <node> (evicts all pods — check PDBs first), node reboot.

Alert: Deployment Stuck / Rollout Not Progressing¶

Severity: P2

First command:

kubectl rollout status deploy/<name> -n <ns>

What you're looking for: "Waiting for deployment ... rollout to finish: X out of Y new replicas have been updated."

Decision tree:

Are new pods Pending (not scheduling)?
├── Yes → kubectl describe pod <new-pod> -n <ns> | grep -A 10 Events
│         → Insufficient resources? Scale cluster or reduce requests.
│         → Taints/tolerations mismatch? Fix deployment spec.
└── No → Are new pods CrashLoopBackOff?
    ├── Yes → Go to CrashLoopBackOff section above.
    └── No → Are old pods not terminating?
        ├── Yes → Check PodDisruptionBudget: kubectl get pdb -n <ns>
        │         PDB blocking? Temporarily scale up old deployment or wait.
        └── No → Check image pull: kubectl describe pod <pod> -n <ns> | grep -i imagepull
                 Registry issue? → See CI/CD guide.

Escalation trigger: Rollout stuck > 15 min; production traffic degraded; PDB cannot be satisfied.

Safe actions: Check rollout status, describe pods, check PDB.

Dangerous actions: kubectl rollout undo deploy/<name> (immediate rollback — coordinate with deploy team).

Alert: PersistentVolumeClaim Pending¶

Severity: P2

First command:

kubectl describe pvc <name> -n <ns>

What you're looking for: Events section — "no persistent volumes available", "storageclass not found", "volume binding failed."

Decision tree:

Is StorageClass missing or wrong?
├── Yes → kubectl get storageclass; fix PVC spec to match available class
└── No → Is it "no volumes available"?
    ├── Yes → kubectl get pv; check available PVs. Dynamic provisioning? Check provisioner pod.
    └── No → Is it in a different AZ than the pod?
        ├── Yes → Topology mismatch — escalate to infra. May need zone-specific StorageClass.
        └── No → kubectl describe storageclass <name>; check provisioner logs.
                 Escalate to storage/infra team.

Escalation trigger: Database pod cannot start due to PVC pending; no PV provisioner running.

Safe actions: Describe PVC, describe PV, get storage classes.

Dangerous actions: Delete and recreate PVC (data loss if not backed up), patch PV reclaim policy.

Alert: OOMKilled Pod¶

Severity: P2

First command:

kubectl top pod -n <ns> --sort-by=memory | head -20

What you're looking for: Pod consuming near or above its memory limit.

Decision tree:

Is this a sudden spike or gradual leak?
├── Sudden spike → Check recent deploys: kubectl rollout history deploy/<name> -n <ns>
│                  New version? → kubectl rollout undo deploy/<name> -n <ns>
└── Gradual leak → Raise memory limit as temp fix:
                   kubectl set resources deploy/<name> --limits=memory=<new-limit> -n <ns>
                   File ticket for dev team to fix leak.

Escalation trigger: OOMKill loop preventing service from running; memory limits are already at maximum.

Safe actions: kubectl top pod, check rollout history.

Dangerous actions: Raise memory limits (may starve other pods), rollback.

Quick Reference¶

Most Useful Commands¶

# All pods in a namespace with status
kubectl get pods -n <ns> -o wide

# Pod logs (live)
kubectl logs -n <ns> <pod> -f

# Previous container logs (after crash)
kubectl logs -n <ns> <pod> --previous

# Events for a namespace (recent problems)
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20

# Resource usage by pod
kubectl top pods -n <ns> --sort-by=memory

# Resource usage by node
kubectl top nodes

# Describe a pod (events, mounts, limits)
kubectl describe pod -n <ns> <pod>

# Rollout history
kubectl rollout history deploy/<name> -n <ns>

# Immediate rollback
kubectl rollout undo deploy/<name> -n <ns>

# Cordon a node (stop scheduling, no eviction)
kubectl cordon <node>

# Drain a node (evict all pods — check PDBs first)
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

Escalation Contacts¶

Situation	Team	Channel
Node failure / infra issue	Infra / Platform	#infra-oncall
App crash (known bug)	App team	#dev-oncall
Cluster-wide outage	Platform lead	PagerDuty: platform-critical
Storage / PVC issue	Storage / Infra	#infra-oncall

Safe vs Dangerous Actions¶

Safe (do without asking)	Dangerous (get approval)
Read logs	Restart production pods
Describe resources	Scale down replicas
Get events	Delete pods or PVCs
Cordon a node	Drain a node
Check rollout status	Roll back a deployment
Top pods/nodes	Edit resource limits

Shift Handoff Template¶

Status: [GREEN/YELLOW/RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]