Oomkilled¶

16 cards — 🟢 3 easy | 🟡 5 medium | 🔴 2 hard

🟢 Easy (3)¶

1. What does OOMKilled mean in Kubernetes?

Show answer

OOMKilled means a container exceeded its memory limit and the kernel's Out-Of-Memory killer terminated it. The container's exit code is 137 (128 + SIGKILL signal 9). The pod restarts per its restartPolicy, and the restart count increments.

Remember: OOMKilled = exceeded memory limit. Exit 137 = 128+SIGKILL(9).

Gotcha: Can be container limit OR node OOM. Check both.

Name origin: OOM = Out Of Memory. The Linux OOM killer was introduced to prevent the entire system from hanging when memory is exhausted.

Number anchor: Exit code 137 = 128 + 9 (SIGKILL). This math is universal in Unix: exit code = 128 + signal number.

2. How do you confirm a pod was OOMKilled?

Show answer

kubectl describe pod — look for Last State: Terminated, Reason: OOMKilled, Exit Code: 137. Also check kubectl get pod -o jsonpath='{.status.containerStatuses[0].lastState}'. Node-level: dmesg | grep -i oom on the node shows the kernel OOM killer invocation.

Remember: OOMKilled = exceeded memory limit. Exit 137 = 128+SIGKILL(9).

Gotcha: Can be container limit OR node OOM. Check both.

3. What is the difference between OOMKilled and eviction?

Show answer

OOMKilled: container exceeded its cgroup memory limit — killed by kernel immediately. Eviction: kubelet proactively removes pods when node memory pressure is high (memory.available < threshold). Evicted pods are rescheduled; OOMKilled containers restart in place. Eviction is gentler (SIGTERM first).

Remember: OOMKilled = exceeded memory limit. Exit 137 = 128+SIGKILL(9).

Gotcha: Can be container limit OR node OOM. Check both.

🟡 Medium (5)¶

1. What is the difference between memory requests and limits in the OOM context?

Show answer

Requests determine scheduling (which node has enough). Limits set the hard ceiling — exceeding the limit triggers OOMKill. A pod can use more than its request (if the node has slack) but never more than its limit. If the node is under memory pressure, pods using more than their request are evicted first.

Remember: OOMKilled = exceeded memory limit. Exit 137 = 128+SIGKILL(9).

Gotcha: Can be container limit OR node OOM. Check both.

2. Why are Java applications especially prone to OOMKilled in containers?

Show answer

The JVM allocates a heap plus off-heap memory (metaspace, thread stacks, JIT code cache, direct buffers). If the container limit only accounts for heap (-Xmx), off-heap can push total RSS over the limit. Fix: set -Xmx to ~75% of container limit, or use -XX:MaxRAMPercentage=75.0 which respects cgroup limits.

Remember: OOMKilled = exceeded memory limit. Exit 137 = 128+SIGKILL(9).

Gotcha: Can be container limit OR node OOM. Check both.

3. How do you determine the right memory limit for a container?

Show answer

1) Run the application under realistic load. 2) Monitor container_memory_working_set_bytes in Prometheus (not RSS — working set excludes cache). 3) Set limit to 1.5-2x the observed peak. 4) Set request to the average steady-state usage. 5) Use VPA (Vertical Pod Autoscaler) recommendations as a starting point.

Remember: OOMKilled = out of memory. Exit 137. Check pod limits AND node memory.

Gotcha: Prevent: set proper limits, monitor trends, fix leaks early.

4. What are strategies to prevent OOMKilled in production?

Show answer

1) Right-size limits based on profiling, not guessing. 2) For JVM apps, align -Xmx with container limit. 3) Enable swap (cgroup v2 memory.swap.max) for burst tolerance. 4) Use VPA for automatic recommendations. 5) Monitor memory trends and alert before hitting limits. 6) Fix memory leaks — growing RSS over time indicates a leak.

Remember: OOMKilled = exceeded memory limit. Exit 137 = 128+SIGKILL(9).

Gotcha: Can be container limit OR node OOM. Check both.

5. What Prometheus metrics help diagnose OOM risk?

Show answer

container_memory_working_set_bytes / container_spec_memory_limit_bytes shows utilization ratio — alert at 80-90%. kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} counts OOM events. container_memory_rss tracks resident set size. node_memory_MemAvailable_bytes tracks node-level pressure.

Remember: OOMKilled = exceeded memory limit. Exit 137 = 128+SIGKILL(9).

Gotcha: Can be container limit OR node OOM. Check both.

🔴 Hard (2)¶

1. How does the Linux OOM killer choose which process to kill?

Show answer

The kernel calculates an oom_score for each process based on memory usage, process age, and oom_score_adj. Higher score = more likely to be killed. In containers, the kernel kills processes within the cgroup that exceeded its limit. oom_score_adj can protect critical processes (-1000 = never kill) or prioritize killing (1000 = kill first).

Remember: OOMKilled = out of memory. Exit 137. Check pod limits AND node memory.

Gotcha: Prevent: set proper limits, monitor trends, fix leaks early.

2. How do cgroup memory limits relate to OOMKilled?

Show answer

Kubernetes sets cgroup memory.limit_in_bytes (cgroup v1) or memory.max (cgroup v2) to the container's memory limit. When the cgroup's memory usage hits this limit and the kernel cannot reclaim pages, it invokes the OOM killer within that cgroup. The container sees SIGKILL — no graceful shutdown, no chance to catch the signal.

Remember: OOMKilled = exceeded memory limit. Exit 137 = 128+SIGKILL(9).

Gotcha: Can be container limit OR node OOM. Check both.