OOMKilled Footguns¶

Mistakes that cause unpredictable pod kills, wasted investigation time, and cascading memory failures.

1. JVM heap size exceeding container memory limit¶

You set -Xmx1g for a Java app in a container limited to 512Mi. The JVM tries to allocate 1GB of heap, the cgroup ceiling is 512MB, and the kernel kills the process instantly.

What happens: Immediate OOMKill on startup or under first load. Exit code 137.

Why: The JVM does not know about cgroup limits when using explicit -Xmx. It requests more memory than the cgroup allows.

How to avoid: Use -XX:MaxRAMPercentage=75.0 instead of -Xmx. This tells the JVM to use 75% of the container's memory limit, leaving headroom for non-heap allocations.

Under the hood: The JVM's total memory consumption is heap + metaspace + thread stacks + direct buffers + native memory + GC overhead. -Xmx only controls heap. A JVM with -Xmx512m can easily consume 700-800MB total. The 75% rule for MaxRAMPercentage leaves room for these non-heap allocations. For native-memory-heavy workloads (Netty, gRPC), use 60-65% instead.

2. No resource limits set (BestEffort QoS)¶

You deploy pods without any resources.limits.memory. One pod has a memory leak. It consumes all available node memory. The kernel's global OOM killer fires and kills random pods on the node — not just the leaking one.

What happens: Collateral damage. Other pods on the same node are killed because one pod is misbehaving.

Why: Without limits, the container has no cgroup ceiling. Node-level OOM affects all processes.

How to avoid: Always set memory limits. Even a generous limit is better than none. Pods without limits get BestEffort QoS and are first in line for eviction.

3. Confusing `container_memory_usage_bytes` with `container_memory_working_set_bytes`¶

You set Prometheus alerts on container_memory_usage_bytes. The metric includes reclaimable page cache, so it shows 90% usage even when actual application memory is at 50%. Your alerts fire constantly and you ignore them. Then a real OOM happens.

What happens: Alert fatigue from false positives, leading to missed real OOMKill events.

Why: container_memory_usage_bytes includes kernel page cache (reclaimable). The OOM killer uses working set bytes, which excludes reclaimable cache.

How to avoid: Use container_memory_working_set_bytes for OOM prediction. This is the metric the kernel actually uses to make OOM decisions.

Debug clue: To see what the kernel sees, check the cgroup directly: cat /sys/fs/cgroup/memory/kubepods/pod<uid>/<container-id>/memory.usage_in_bytes (v1) or memory.current (v2). Compare with memory.stat to break down anonymous pages vs cache. The OOM killer fires when memory.current exceeds memory.max and the kernel cannot reclaim enough cache pages.

4. Ignoring sidecar memory consumption¶

Your app container is limited to 512Mi and uses 400Mi. An Istio sidecar proxy uses 150Mi. Each container has its own limit, but you sized the app limit without accounting for total pod memory on the node.

What happens: Node runs out of schedulable memory faster than expected because sidecar overhead was not budgeted.

Why: Each container in a pod has its own resource limits. The node must have enough capacity for all containers across all pods.

How to avoid: Account for sidecar memory when capacity planning. Check all containers: kubectl top pod myapp --containers.

5. Setting request equal to limit for everything (over-provisioning)¶

You set requests.memory: 2Gi and limits.memory: 2Gi on every pod for Guaranteed QoS. Most pods use 500Mi. Your cluster can only schedule 1/4 as many pods as it could with realistic requests.

What happens: Wasted cluster capacity. Pods go Pending because the scheduler reserves more memory than workloads actually use.

Why: Requests are the scheduler's guarantee. If you request 2Gi, 2Gi is reserved on the node even if the pod only uses 500Mi.

How to avoid: Set requests based on typical usage and limits based on peak usage. Use Guaranteed QoS only for truly critical workloads. Use VPA recommendations to right-size.

6. Memory leak diagnosed as "needs more memory"¶

The pod gets OOMKilled after 6 hours. You increase the limit from 512Mi to 1Gi. Now it gets OOMKilled after 12 hours. You increase to 2Gi. It gets OOMKilled after 24 hours.

What happens: You keep increasing limits but the pod always eventually dies. The root cause is a memory leak, not undersized limits.

Why: A memory leak means usage grows linearly over time regardless of the limit. More memory just delays the inevitable.

How to avoid: If OOMKill time scales linearly with limit size, it is a leak. Profile the application with language-specific tools (pprof for Go, heap dumps for Java, tracemalloc for Python). Fix the leak instead of raising the limit.

7. LimitRange defaults that are too low¶

You set a namespace LimitRange with default.memory: 128Mi as a safety net. New deployments inherit this limit. A developer deploys a service that needs 500Mi at startup. It gets OOMKilled immediately. The developer spends an hour debugging before discovering the LimitRange.

What happens: Silent, unexpected OOMKills on new deployments that do not explicitly set limits.

Why: LimitRange defaults apply when a container does not specify its own limits. Developers may not know the default exists.

How to avoid: Set LimitRange defaults to reasonable values (not minimums). Document the namespace defaults. Include the LimitRange info in your team's deployment templates.

8. Not checking `oom_score_adj` after QoS changes¶

You change a pod from Burstable to Guaranteed QoS by setting requests equal to limits. But a configuration management tool resets the deployment, and the pod reverts to Burstable. Under node pressure, it gets evicted before the critical pods you intended to protect.

What happens: Pod eviction priority is not what you expected. Critical workload is killed before less important ones.

Why: QoS class determines oom_score_adj. Burstable pods (2-999) are killed before Guaranteed pods (-997). If the QoS class reverted, the protection reverted too.

How to avoid: Verify QoS class after deployment: kubectl get pod <name> -o jsonpath='{.status.qosClass}'. Include QoS class checks in your CI pipeline.

9. ResourceQuota blocking legitimate deployments¶

You set a namespace ResourceQuota with limits.memory: 8Gi. Teams deploy more workloads. New pods are rejected with "exceeded quota" but the error message is buried in events. Developers think the cluster is full.

What happens: Deployment failures that look like capacity issues but are actually quota enforcement.

Why: ResourceQuota is enforced at admission time. Pods that would exceed the quota are rejected before scheduling.

How to avoid: Monitor quota usage: kubectl describe resourcequota -n <ns>. Alert at 80% quota utilization. Communicate quota limits to development teams.

Gotcha: ResourceQuota errors appear in kubectl get events, not in kubectl get pods. The pod simply doesn't exist — there's nothing to describe. Developers run kubectl get pods, see no new pods, and assume the cluster is broken. Check events with kubectl get events -n <ns> --sort-by='.lastTimestamp' | grep quota to surface the real error.