OOMKilled - Street-Level Ops¶

Real-world workflows for diagnosing, fixing, and preventing OOMKilled pods in production.

Identify the OOMKill¶

# Quick scan: find all OOMKilled pods
kubectl get pods -A -o json | jq -r '.items[] | select(.status.containerStatuses[]? | .lastState.terminated.reason == "OOMKilled") | "\(.metadata.namespace)/\(.metadata.name) restarts=\(.status.containerStatuses[0].restartCount)"'

# Check a specific pod
kubectl describe pod myapp-7f8c9d6b4-x2k9p -n production | grep -A5 "Last State"
#     Last State:     Terminated
#       Reason:       OOMKilled
#       Exit Code:    137
#       Started:      Sun, 15 Mar 2026 02:14:33 +0000
#       Finished:     Sun, 15 Mar 2026 02:17:58 +0000

# Extract programmatically
kubectl get pod myapp-7f8c9d6b4-x2k9p -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
# OOMKilled

Remember: OOMKill exit code mnemonic: 137 = 128 + 9 (SIGKILL). The kernel's OOM killer sends SIGKILL — there is no graceful shutdown, no cleanup, no final log line. If you see exit code 137, it was an OOM kill. Exit code 143 (128 + 15, SIGTERM) is a normal graceful shutdown.

Check Current Memory Usage¶

# Pod-level memory usage
kubectl top pod -n production --sort-by=memory
# NAME                       CPU(cores)   MEMORY(bytes)
# myapp-7f8c9d6b4-x2k9p     45m          498Mi
# myapp-7f8c9d6b4-r3m7q     38m          472Mi

# Check configured limits vs actual usage
kubectl get pod myapp-7f8c9d6b4-x2k9p -n production \
  -o jsonpath='{.spec.containers[0].resources}'
# {"limits":{"memory":"512Mi"},"requests":{"memory":"256Mi"}}

# Usage is 498Mi out of 512Mi limit — OOMKill is imminent

Under the hood: Kubernetes uses container_memory_working_set_bytes (RSS + active cache minus reclaimable pages) to decide when to OOM-kill, not the container_memory_usage_bytes that kubectl top reports. The working set can be lower than total usage because the kernel can reclaim inactive file cache. This is why a container at "490Mi usage" with a 512Mi limit may survive — the working set might only be 400Mi.

Determine Container-Level vs Node-Level OOM¶

# Container-level OOM: single pod, reason is OOMKilled
kubectl describe pod myapp-7f8c9d6b4-x2k9p | grep -i "oomkilled"

# Node-level OOM: check kernel logs on the node
# SSH to the node where the pod was running
kubectl get pod myapp-7f8c9d6b4-x2k9p -o jsonpath='{.spec.nodeName}'
# worker-01

ssh worker-01
dmesg | grep -i "oom-kill\|out of memory" | tail -10
# [482918.392] myapp invoked oom-killer: gfp_mask=0xcc0, order=0
# [482918.401] Memory cgroup out of memory: Killed process 18234 (java)

# Check kubelet eviction activity
journalctl -u kubelet | grep -i "evict\|memory" | tail -10

# Check node memory pressure
kubectl describe node worker-01 | grep MemoryPressure
# MemoryPressure   False   (or True if under pressure)

Debug clue: dmesg shows "Memory cgroup out of memory" for container-level kills (cgroup limit hit) vs "Out of memory: Kill process" for node-level kills (system-wide OOM). The distinction matters: container-level means your limit is too low; node-level means the node is overcommitted and you may need to evict other workloads or add capacity.

Check All Containers in the Pod¶

# Multi-container pods: the sidecar might be the memory hog
kubectl get pod myapp-7f8c9d6b4-x2k9p -o jsonpath='{range .spec.containers[*]}{.name}: limits={.resources.limits.memory}{"\n"}{end}'
# myapp: limits=512Mi
# istio-proxy: limits=256Mi

# Check which container was OOMKilled
kubectl get pod myapp-7f8c9d6b4-x2k9p -o json | \
  jq '.status.containerStatuses[] | select(.lastState.terminated.reason == "OOMKilled") | .name'

Fix: Adjust Memory Limits¶

# Option 1: Increase limits in the deployment
kubectl set resources deployment/myapp -n production \
  --limits=memory=1Gi --requests=memory=512Mi

# Option 2: Patch the deployment YAML directly
kubectl patch deployment myapp -n production -p \
  '{"spec":{"template":{"spec":{"containers":[{"name":"myapp","resources":{"limits":{"memory":"1Gi"},"requests":{"memory":"512Mi"}}}]}}}}'

# Verify the change rolled out
kubectl rollout status deployment/myapp -n production

Default trap: kubectl set resources with --limits but no --requests leaves requests unchanged. If you raise limits to 1Gi but requests stay at 256Mi, the pod's QoS class becomes Burstable (not Guaranteed), and it will be evicted before Guaranteed pods under memory pressure. Always set both together for critical workloads.

Fix: JVM Memory Configuration¶

# Check current JVM flags
kubectl exec -it myapp-abc123 -- java -XX:+PrintFlagsFinal -version 2>&1 | grep -i "maxheap\|maxram"

# Modern JVM (11+): use percentage of container memory
# Set JAVA_OPTS in the deployment:
# env:
#   - name: JAVA_OPTS
#     value: "-XX:MaxRAMPercentage=75.0"
# This leaves 25% for non-heap: metaspace, thread stacks, native allocs, NIO buffers

# Check what the JVM sees as available memory
kubectl exec -it myapp-abc123 -- java -XshowSettings:vm -version 2>&1 | grep "Max Heap"

Gotcha: JVMs before version 10 do not respect container memory limits — they see the host's total RAM and set heap accordingly. A 512Mi container on a 64GB node gets a default heap of ~16GB, instantly OOM-killed. Always use JVM 11+ with -XX:MaxRAMPercentage or explicitly set -Xmx to a value below the container limit.

VPA Recommendations¶

# If VPA is installed, check its recommendations
kubectl get vpa myapp-vpa -n production -o json | \
  jq '.status.recommendation.containerRecommendations[] | {container: .containerName, target: .target.memory, upperBound: .upperBound.memory}'
# {
#   "container": "myapp",
#   "target": "384Mi",
#   "upperBound": "600Mi"
# }

# Use the upperBound as your limit, target as your request

Monitor Before Setting Limits¶

# Watch memory usage over time (run in a loop)
while true; do
  echo "$(date): $(kubectl top pod -n production -l app=myapp --no-headers | awk '{print $1, $3}')"
  sleep 30
done

# Prometheus query for container memory near limit
# container_memory_working_set_bytes{container="myapp"}
#   / on(namespace, pod, container)
# container_spec_memory_limit_bytes{container="myapp"} > 0.8

Scale note: At fleet scale, set up Prometheus alerting rules on container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.85 to catch OOM kills before they happen. A 15% buffer gives you time to right-size limits or investigate leaks before the kernel kills the process.

QoS Class Check¶

# Check pod QoS class
kubectl get pod myapp-7f8c9d6b4-x2k9p -o jsonpath='{.status.qosClass}'
# Burstable

# For critical workloads, set requests == limits for Guaranteed QoS
# Guaranteed pods are last to be evicted under node memory pressure

# Find all BestEffort pods (no limits set — first to be killed)
kubectl get pods -A -o json | jq -r '.items[] | select(.status.qosClass == "BestEffort") | "\(.metadata.namespace)/\(.metadata.name)"'

Remember: QoS eviction order mnemonic: B-B-G — BestEffort (killed first, no guarantees), Burstable (killed next, partial guarantees), Guaranteed (killed last, requests == limits). For production databases and stateful workloads, always set requests equal to limits to get Guaranteed QoS.

Prevention: LimitRange and ResourceQuota¶

# Set default limits for pods that forget to specify them
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: LimitRange
metadata:
  name: mem-limit-range
  namespace: production
spec:
  limits:
  - default:
      memory: 512Mi
    defaultRequest:
      memory: 256Mi
    type: Container
EOF

# Enforce namespace-level memory budget
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: mem-quota
  namespace: production
spec:
  hard:
    requests.memory: 8Gi
    limits.memory: 16Gi
EOF

# Check quota usage
kubectl describe resourcequota mem-quota -n production

Quick Reference¶

Runbook: Oomkilled

OOMKilled - Street-Level Ops¶

Identify the OOMKill¶

Check Current Memory Usage¶

Determine Container-Level vs Node-Level OOM¶

Check All Containers in the Pod¶

Fix: Adjust Memory Limits¶

Fix: JVM Memory Configuration¶

VPA Recommendations¶

Monitor Before Setting Limits¶

QoS Class Check¶

Prevention: LimitRange and ResourceQuota¶

Quick Reference¶

Pages that link here¶