OOMKilled — Trivia & Interesting Facts¶
Surprising, historical, and little-known facts about OOMKilled in Kubernetes and Linux.
The OOM killer is a Linux kernel feature, not a Kubernetes feature¶
Kubernetes did not invent OOM killing — the Linux kernel's Out-Of-Memory killer has existed since Linux 2.6 (2003). When physical memory and swap are exhausted, the kernel must kill a process to survive. The OOM killer selects a victim based on an internal scoring algorithm (oom_score) that considers memory usage, process age, and privilege level. Kubernetes leverages this mechanism through cgroup memory limits.
The OOM killer uses a scoring algorithm, and you can influence it¶
Every process has an oom_score (0-1000) in /proc/<pid>/oom_score and an adjustable oom_score_adj (-1000 to 1000) in /proc/<pid>/oom_score_adj. Setting oom_score_adj to -1000 makes a process immune to OOM killing. Kubelet sets oom_score_adj based on QoS class: Guaranteed pods get -997, BestEffort pods get 1000, and Burstable pods get a value in between. This is why BestEffort pods are always killed first.
Exit code 137 is the universal OOMKilled fingerprint¶
A container killed by the OOM killer exits with code 137, which is 128 + 9 (SIGKILL). The process receives no warning — SIGKILL cannot be caught, blocked, or ignored. There is no graceful shutdown, no cleanup, no final log message. The last log line before an OOM kill often has nothing to do with the actual cause, which makes post-mortem analysis particularly frustrating.
Container memory limits trigger OOM before system memory is exhausted¶
In Kubernetes, OOM kills happen at the cgroup level, not the system level. A container with a 256 MB memory limit is killed when it exceeds 256 MB, even if the node has 100 GB of free memory. This is intentional — cgroup limits protect other containers on the same node. The kernel's cgroup OOM killer is distinct from the global OOM killer and acts on the specific cgroup that exceeded its limit.
RSS, cache, and swap all count toward the memory limit¶
The memory limit applies to the cgroup's total memory usage, which includes: RSS (resident set size — actual program data), page cache (files read by the process), and swap (if enabled). Page cache usage surprises many teams: a process that reads a 500 MB file will use 500 MB of memory (in page cache) even if the application's heap is only 50 MB. This is why I/O-heavy applications get OOM-killed with seemingly low heap usage.
Java applications are the most common OOMKilled victims in Kubernetes¶
The JVM allocates memory outside the heap (metaspace, thread stacks, direct buffers, JIT compiled code, GC overhead) that is not controlled by -Xmx. A container with a 512 MB limit and -Xmx512m will always get OOMKilled because total JVM memory = heap + non-heap, which exceeds 512 MB. The rule of thumb: set the container limit to at least 1.5-2x the -Xmx value. Java 10+ added -XX:MaxRAMPercentage which calculates heap as a percentage of the container's memory limit.
Go programs can OOMKill themselves through goroutine stack growth¶
Each Go goroutine starts with a tiny stack (a few KB) that grows dynamically. A program spawning millions of goroutines can consume gigabytes of memory just in stack space, with no single large allocation visible in memory profiles. Since goroutine stacks are not reported by the Go runtime.MemStats heap metrics, standard monitoring misses this growth pattern entirely. The OOMKill appears sudden and unexplained.
The dmesg output is the definitive OOMKilled diagnostic¶
When the cgroup OOM killer activates, it writes a detailed message to the kernel ring buffer (visible via dmesg on the node). This message includes: the killed process name, its RSS, the cgroup path, the memory limit, and a table of every process in the cgroup with their individual memory usage. This information does not appear in kubectl logs, kubectl describe pod, or application logs — only in dmesg or the node's syslog.
Memory requests below actual usage cause pods to be evicted, not OOMKilled¶
OOMKilled (exit code 137) happens when a process exceeds the memory limit. But there is a different mechanism: kubelet eviction. When node memory pressure is high, kubelet evicts pods whose memory usage exceeds their memory request (but is still below their limit). The evicted pod is terminated with a "The node was low on resource: memory" message. This is softer than OOMKill but equally disruptive, and it confuses teams who set limits correctly but forgot about requests.
Memory ballooning and gradual leaks are harder to catch than sudden spikes¶
An application that leaks 1 MB per hour will run fine for days or weeks before hitting the memory limit and getting OOMKilled. The OOMKill appears sudden, but the root cause started long ago. Continuous memory monitoring (Prometheus + Grafana with container_memory_working_set_bytes) is the standard approach to catching slow leaks. Setting alerts at 80% of the memory limit gives teams hours to investigate before the kill.
Kubernetes reports OOMKilled in pod status, but the details are elsewhere¶
kubectl describe pod shows Reason: OOMKilled in the container status, but it does not show which process was killed, how much memory it used, or what the limit was. For those details, you need: the node's dmesg (process-level information), the container's cgroup memory stats (real-time usage), or Prometheus metrics (historical trend). The gap between "Kubernetes tells you it happened" and "you can figure out why" is significant.
Swap in Kubernetes was disabled for a decade, then cautiously re-enabled¶
Kubernetes nodes traditionally required swap to be disabled (swapoff -a), and kubelet refused to start with swap enabled. The reasoning: swap makes memory usage unpredictable, and the OOM killer's behavior with swap is complex. Kubernetes 1.28 (August 2023) introduced limited swap support as beta, allowing swap usage within cgroup v2 memory limits. This is especially valuable for burstable workloads that briefly exceed their memory allocation.