k8s
l1
runbook
oom
k8s-core --- Portal | Level: L1: Foundations | Topics: Kubernetes Core, OOMKilled | Domain: Kubernetes

Runbook: OOMKilled Container¶

Field	Value
Domain	Kubernetes
Alert	`container_oom_events_total > 0` or pod status shows OOMKilled
Severity	P2
Est. Resolution Time	10-20 minutes
Escalation Timeout	20 minutes — page if not resolved for recurring OOMKills
Last Tested	2026-03-19
Prerequisites	kubectl access, cluster-admin or namespace-admin, kubeconfig configured

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
kubectl get pods -n <NAMESPACE> -o wide | grep -E "OOMKilled|Error|CrashLoop"

If output shows: multiple pods OOMKilled across deployments → This may be a cluster-level memory issue or a traffic spike — check kubectl top nodes before proceeding If output shows: a single pod or one deployment OOMKilled → Continue with steps below

Step 1: Confirm OOMKill via Exit Code 137¶

Why: Exit code 137 is the definitive signal of an OOMKill (SIGKILL sent by the Linux OOM killer). Without confirming this, you may spend time debugging the wrong problem.

kubectl get pod <POD_NAME> -n <NAMESPACE> -o jsonpath='{.status.containerStatuses[*].lastState.terminated}'

Expected output (OOMKill confirmed):

{"containerID":"...","exitCode":137,"finishedAt":"2026-03-19T10:00:00Z","reason":"OOMKilled","startedAt":"2026-03-19T09:59:55Z"}

If exit code is 137 and reason is OOMKilled: Confirmed — continue. If exit code is 1 with reason Error: This is an application crash, not an OOMKill — follow pod-crashloop.md instead. Alternative check using describe:

kubectl describe pod <POD_NAME> -n <NAMESPACE> | grep -A 5 "Last State"

Step 2: Check Memory Usage Trend¶

Why: A one-time OOMKill (e.g., from a traffic spike or a large batch job) needs a different response than a continuous memory leak. The trend tells you which you are dealing with.

# Current memory usage of pods in the namespace
kubectl top pods -n <NAMESPACE>

# Check node memory usage
kubectl top nodes

# Get memory limit for the affected container
kubectl get pod <POD_NAME> -n <NAMESPACE> -o jsonpath='{.spec.containers[*].resources}'

Expected output (kubectl top pods):

NAME                    CPU(cores)   MEMORY(bytes)
my-app-5c4a2b1d9-xkq8   45m          478Mi
my-app-5c4a2b1d9-p7r2   52m          495Mi

If memory is near or above the limit shown in deployment spec: The limit is too low for actual usage — continue to Step 3. If memory spiked briefly: Review what happened at spike time (scheduled job, traffic burst, cache load) — the limit may be fine but requests/limits ratio is wrong. If metrics-server is not installed: Use kubectl exec -it <POD_NAME> -n <NAMESPACE> -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes as a fallback.

Step 3: Check Memory Limits in Deployment¶

Why: The OOMKill is caused by the container exceeding its resources.limits.memory setting. You need to see the current value to know how much to raise it.

kubectl get deployment <DEPLOYMENT_NAME> -n <NAMESPACE> -o jsonpath='{.spec.template.spec.containers[*].resources}' | python3 -m json.tool

# Or in a more readable format:
kubectl describe deployment <DEPLOYMENT_NAME> -n <NAMESPACE> | grep -A 10 "Limits\|Requests"

Expected output:

Limits:
  memory: 512Mi
  cpu: 500m
Requests:
  memory: 256Mi
  cpu: 100m

If limits are very low relative to the observed usage from Step 2: The limit needs to be raised — continue to Step 5. If limits appear generous but OOMKill still occurs: The application may have a memory leak — continue to Step 4 before raising limits.

Step 4: Analyze Heap/Memory Profile If Possible¶

Why: Blindly raising the memory limit without understanding whether the memory usage is legitimate or a leak will cost you again. Even a quick check saves a follow-up incident.

# Check previous container logs for memory-related warnings
kubectl logs <POD_NAME> -n <NAMESPACE> -p --tail=200 | grep -iE "memory|heap|oom|gc|out of"

# For JVM-based apps — check GC pressure
kubectl logs <POD_NAME> -n <NAMESPACE> -p --tail=200 | grep -iE "GC|heap|java.lang.OutOfMemory"

# For Node.js apps
kubectl logs <POD_NAME> -n <NAMESPACE> -p --tail=200 | grep -iE "heap|FATAL ERROR|Allocation failed"

# For Go apps
kubectl logs <POD_NAME> -n <NAMESPACE> -p --tail=200 | grep -iE "runtime: out of memory|signal: killed"

# If the app has a /debug/pprof endpoint (Go), exec into pod and profile before it crashes
kubectl exec -it <RUNNING_POD_NAME> -n <NAMESPACE> -- curl -s http://localhost:<PORT>/debug/pprof/heap -o /tmp/heap.prof

Expected output (memory leak indicator):

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

If evidence of a memory leak exists: Raise the limit as a short-term fix (Step 5) but file a bug for the development team. Do not let a leak run indefinitely just because the limit is high enough. If no evidence of a leak: The application has legitimate higher memory needs — raise the limit and adjust requests to match.

Step 5: Raise Memory Limit¶

Why: The immediate fix is to raise the memory limit so the container has headroom. You should add ~25-50% headroom above the observed peak usage from Step 2.

kubectl edit deployment <DEPLOYMENT_NAME> -n <NAMESPACE>
# Under spec.template.spec.containers[].resources, change:
#   limits.memory from <CURRENT_VALUE> to <NEW_VALUE>
#   requests.memory to match the new limit (see Common Mistakes)

# Or use patch directly (non-interactive):
kubectl patch deployment <DEPLOYMENT_NAME> -n <NAMESPACE> \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"<CONTAINER_NAME>","resources":{"limits":{"memory":"<NEW_MEMORY_LIMIT>"},"requests":{"memory":"<NEW_MEMORY_REQUEST>"}}}]}}}}'

Memory sizing guide: - Observed peak: 480Mi → set limit to 640Mi (add ~33% headroom) - Observed peak: 1.8Gi → set limit to 2.5Gi - requests.memory should be ≥ 50% of limits.memory (so pods are not over-committed on nodes)

Expected output after patch: The deployment triggers a rollout. New pods start with the higher limit. If this fails: Check for LimitRange objects that cap the maximum allowed limit:

kubectl get limitrange -n <NAMESPACE> -o yaml

Step 6: Redeploy and Verify Memory Stabilizes¶

Why: After raising the limit, you need to confirm the pod stays healthy and memory usage is stable (not still climbing toward the new limit).

# Watch the rollout
kubectl rollout status deployment/<DEPLOYMENT_NAME> -n <NAMESPACE> --timeout=5m

# Watch memory usage over the next 5 minutes
watch -n 10 kubectl top pods -n <NAMESPACE>

# Check restart count is not incrementing
kubectl get pods -n <NAMESPACE> -l app=<APP_LABEL> -w

Expected output (stable):

NAME                    CPU(cores)   MEMORY(bytes)
my-app-7f9b4c2d1-abc    48m          510Mi   <-- below new limit of 640Mi, stable

If memory continues to climb toward the new limit: This is a memory leak — raise the limit again as a stop-gap and immediately escalate to the development team.

Verification¶

# Confirm the issue is resolved
kubectl get pods -n <NAMESPACE> -l app=<APP_LABEL>
kubectl describe pod <NEW_POD_NAME> -n <NAMESPACE> | grep -A 5 "Last State"

Success looks like: All pods show Running with READY at full count, Last State does not show OOMKilled, and memory usage from kubectl top is stable below the limit. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 20 min (recurring OOMKills)	SRE on-call	"Kubernetes OOMKill in , pod , recurring after limit raise, runbook exhausted"
Data loss suspected	Platform Lead	"Data loss risk: stateful pod OOMKilled, possible in-flight transaction loss"
Scope expanding beyond namespace	Platform team	"Multi-namespace impact: cluster-wide memory pressure, multiple pods OOMKilled"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete
File a bug to the development team if memory leak was detected in Step 4
Update the deployment manifest in git with the new limits (do not leave it only patched live)
Review whether HPA should scale out instead of relying solely on higher per-pod limits

Common Mistakes¶

Raising the limit without setting requests to match: Kubernetes schedules pods based on resources.requests, not resources.limits. If you raise the limit to 2Gi but leave requests at 256Mi, the scheduler thinks the pod only needs 256Mi and may schedule it on a node that cannot actually support 2Gi. This causes the node to go under memory pressure, potentially OOMKilling other pods. Always set requests to a value close to (but not above) the expected steady-state memory usage, and set limits to the peak/maximum you want to allow.
Raising the limit without investigating whether it is a memory leak: Raising the limit is only a temporary fix if the underlying issue is a leak. Memory leaks that grow without bound will eventually OOMKill the pod again at the new higher limit — possibly taking longer to manifest and catching the on-call engineer off-guard. Always check logs for leak indicators (Step 4) before treating the limit raise as a permanent solution.

Cross-References¶

Survival Guide: On-Call Survival Guide (pocket card version)
Topic Pack: Kubernetes Topics (deep background)
Related Runbook: pod-crashloop.md — if pod enters CrashLoopBackOff after OOMKill
Related Runbook: hpa-thrashing.md — if OOMKills are triggering unwanted scale events
Related Runbook: node-not-ready.md — if node-level memory pressure caused the OOMKill
Lab: training/interactive/runtime-labs/lab-runtime-08-resource-limits-oom/
Interview Scenario: training/interview-scenarios/08-pods-oomkilled.md
Incident Scenario: training/interactive/incidents/scenarios/oomkill-low-memory.sh

Case Study: Node Pressure Evictions (Case Study, L2) — Kubernetes Core, OOMKilled
Case Study: Pod OOMKilled — Memory Leak in Sidecar, Fix Is Helm Values (Case Study, L2) — Kubernetes Core, OOMKilled
Lab: Resource Limits OOMKilled (CLI) (Lab, L1) — Kubernetes Core, OOMKilled
Ops Archaeology: The Session Store That Keeps Dying (Case Study, L2) — Kubernetes Core, OOMKilled
Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core

Runbook: OOMKilled Container¶

Quick Assessment (30 seconds)¶

Step 1: Confirm OOMKill via Exit Code 137¶

Step 2: Check Memory Usage Trend¶

Step 3: Check Memory Limits in Deployment¶

Step 4: Analyze Heap/Memory Profile If Possible¶

Step 5: Raise Memory Limit¶

Step 6: Redeploy and Verify Memory Stabilizes¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: OOMKilled Container¶

Quick Assessment (30 seconds)¶

Step 1: Confirm OOMKill via Exit Code 137¶

Step 2: Check Memory Usage Trend¶

Step 3: Check Memory Limits in Deployment¶

Step 4: Analyze Heap/Memory Profile If Possible¶

Step 5: Raise Memory Limit¶

Step 6: Redeploy and Verify Memory Stabilizes¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶