- k8s
- l1
- runbook
- oom
- k8s-core --- Portal | Level: L1: Foundations | Topics: Kubernetes Core, OOMKilled | Domain: Kubernetes
Runbook: OOMKilled Container¶
| Field | Value |
|---|---|
| Domain | Kubernetes |
| Alert | container_oom_events_total > 0 or pod status shows OOMKilled |
| Severity | P2 |
| Est. Resolution Time | 10-20 minutes |
| Escalation Timeout | 20 minutes — page if not resolved for recurring OOMKills |
| Last Tested | 2026-03-19 |
| Prerequisites | kubectl access, cluster-admin or namespace-admin, kubeconfig configured |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
kubectl get pods -n <NAMESPACE> -o wide | grep -E "OOMKilled|Error|CrashLoop"
kubectl top nodes before proceeding
If output shows: a single pod or one deployment OOMKilled → Continue with steps below
Step 1: Confirm OOMKill via Exit Code 137¶
Why: Exit code 137 is the definitive signal of an OOMKill (SIGKILL sent by the Linux OOM killer). Without confirming this, you may spend time debugging the wrong problem.
kubectl get pod <POD_NAME> -n <NAMESPACE> -o jsonpath='{.status.containerStatuses[*].lastState.terminated}'
{"containerID":"...","exitCode":137,"finishedAt":"2026-03-19T10:00:00Z","reason":"OOMKilled","startedAt":"2026-03-19T09:59:55Z"}
Step 2: Check Memory Usage Trend¶
Why: A one-time OOMKill (e.g., from a traffic spike or a large batch job) needs a different response than a continuous memory leak. The trend tells you which you are dealing with.
# Current memory usage of pods in the namespace
kubectl top pods -n <NAMESPACE>
# Check node memory usage
kubectl top nodes
# Get memory limit for the affected container
kubectl get pod <POD_NAME> -n <NAMESPACE> -o jsonpath='{.spec.containers[*].resources}'
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes as a fallback.
Step 3: Check Memory Limits in Deployment¶
Why: The OOMKill is caused by the container exceeding its resources.limits.memory setting. You need to see the current value to know how much to raise it.
kubectl get deployment <DEPLOYMENT_NAME> -n <NAMESPACE> -o jsonpath='{.spec.template.spec.containers[*].resources}' | python3 -m json.tool
# Or in a more readable format:
kubectl describe deployment <DEPLOYMENT_NAME> -n <NAMESPACE> | grep -A 10 "Limits\|Requests"
Step 4: Analyze Heap/Memory Profile If Possible¶
Why: Blindly raising the memory limit without understanding whether the memory usage is legitimate or a leak will cost you again. Even a quick check saves a follow-up incident.
# Check previous container logs for memory-related warnings
kubectl logs <POD_NAME> -n <NAMESPACE> -p --tail=200 | grep -iE "memory|heap|oom|gc|out of"
# For JVM-based apps — check GC pressure
kubectl logs <POD_NAME> -n <NAMESPACE> -p --tail=200 | grep -iE "GC|heap|java.lang.OutOfMemory"
# For Node.js apps
kubectl logs <POD_NAME> -n <NAMESPACE> -p --tail=200 | grep -iE "heap|FATAL ERROR|Allocation failed"
# For Go apps
kubectl logs <POD_NAME> -n <NAMESPACE> -p --tail=200 | grep -iE "runtime: out of memory|signal: killed"
# If the app has a /debug/pprof endpoint (Go), exec into pod and profile before it crashes
kubectl exec -it <RUNNING_POD_NAME> -n <NAMESPACE> -- curl -s http://localhost:<PORT>/debug/pprof/heap -o /tmp/heap.prof
Step 5: Raise Memory Limit¶
Why: The immediate fix is to raise the memory limit so the container has headroom. You should add ~25-50% headroom above the observed peak usage from Step 2.
kubectl edit deployment <DEPLOYMENT_NAME> -n <NAMESPACE>
# Under spec.template.spec.containers[].resources, change:
# limits.memory from <CURRENT_VALUE> to <NEW_VALUE>
# requests.memory to match the new limit (see Common Mistakes)
# Or use patch directly (non-interactive):
kubectl patch deployment <DEPLOYMENT_NAME> -n <NAMESPACE> \
-p '{"spec":{"template":{"spec":{"containers":[{"name":"<CONTAINER_NAME>","resources":{"limits":{"memory":"<NEW_MEMORY_LIMIT>"},"requests":{"memory":"<NEW_MEMORY_REQUEST>"}}}]}}}}'
Expected output after patch: The deployment triggers a rollout. New pods start with the higher limit. If this fails: Check for LimitRange objects that cap the maximum allowed limit:
Step 6: Redeploy and Verify Memory Stabilizes¶
Why: After raising the limit, you need to confirm the pod stays healthy and memory usage is stable (not still climbing toward the new limit).
# Watch the rollout
kubectl rollout status deployment/<DEPLOYMENT_NAME> -n <NAMESPACE> --timeout=5m
# Watch memory usage over the next 5 minutes
watch -n 10 kubectl top pods -n <NAMESPACE>
# Check restart count is not incrementing
kubectl get pods -n <NAMESPACE> -l app=<APP_LABEL> -w
Verification¶
# Confirm the issue is resolved
kubectl get pods -n <NAMESPACE> -l app=<APP_LABEL>
kubectl describe pod <NEW_POD_NAME> -n <NAMESPACE> | grep -A 5 "Last State"
Running with READY at full count, Last State does not show OOMKilled, and memory usage from kubectl top is stable below the limit.
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 20 min (recurring OOMKills) | SRE on-call | "Kubernetes OOMKill in |
| Data loss suspected | Platform Lead | "Data loss risk: stateful pod |
| Scope expanding beyond namespace | Platform team | "Multi-namespace impact: cluster-wide memory pressure, multiple pods OOMKilled" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- File a bug to the development team if memory leak was detected in Step 4
- Update the deployment manifest in git with the new limits (do not leave it only patched live)
- Review whether HPA should scale out instead of relying solely on higher per-pod limits
Common Mistakes¶
- Raising the limit without setting requests to match: Kubernetes schedules pods based on
resources.requests, notresources.limits. If you raise the limit to 2Gi but leave requests at 256Mi, the scheduler thinks the pod only needs 256Mi and may schedule it on a node that cannot actually support 2Gi. This causes the node to go under memory pressure, potentially OOMKilling other pods. Always set requests to a value close to (but not above) the expected steady-state memory usage, and set limits to the peak/maximum you want to allow. - Raising the limit without investigating whether it is a memory leak: Raising the limit is only a temporary fix if the underlying issue is a leak. Memory leaks that grow without bound will eventually OOMKill the pod again at the new higher limit — possibly taking longer to manifest and catching the on-call engineer off-guard. Always check logs for leak indicators (Step 4) before treating the limit raise as a permanent solution.
Cross-References¶
- Survival Guide: On-Call Survival Guide (pocket card version)
- Topic Pack: Kubernetes Topics (deep background)
- Related Runbook: pod-crashloop.md — if pod enters CrashLoopBackOff after OOMKill
- Related Runbook: hpa-thrashing.md — if OOMKills are triggering unwanted scale events
- Related Runbook: node-not-ready.md — if node-level memory pressure caused the OOMKill
- Lab:
training/interactive/runtime-labs/lab-runtime-08-resource-limits-oom/ - Interview Scenario:
training/interview-scenarios/08-pods-oomkilled.md - Incident Scenario:
training/interactive/incidents/scenarios/oomkill-low-memory.sh
Wiki Navigation¶
Related Content¶
- Case Study: Node Pressure Evictions (Case Study, L2) — Kubernetes Core, OOMKilled
- Case Study: Pod OOMKilled — Memory Leak in Sidecar, Fix Is Helm Values (Case Study, L2) — Kubernetes Core, OOMKilled
- Lab: Resource Limits OOMKilled (CLI) (Lab, L1) — Kubernetes Core, OOMKilled
- Ops Archaeology: The Session Store That Keeps Dying (Case Study, L2) — Kubernetes Core, OOMKilled
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
- Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
- Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
Pages that link here¶
- Decision Tree: Latency Has Increased
- Decision Tree: Memory Usage Is High
- Decision Tree: Node Is NotReady
- Decision Tree: Pod Won't Start
- Kubernetes Ops Domain
- Kubernetes Pod Lifecycle
- Level 3: Production Kubernetes
- OOMKilled - Street-Level Ops
- On-Call Survival Guides
- Oomkilled
- Operational Runbooks
- Ops Archaeology: The Session Store That Keeps Dying
- Primer
- Runbook: HPA Thrashing (Rapid Scale Up/Down)
- Runbook: Ingress 502 Bad Gateway