Out of Memory

lesson
linux-memory
oom-killer
cgroups
kubernetes-resources
containers
jvm
swap ---# Out of Memory

Topics: Linux memory, OOM killer, cgroups, Kubernetes resources, containers, JVM, swap Level: L1–L2 (Foundations → Operations) Time: 60–90 minutes Prerequisites: None (everything is explained from scratch)

The Mission¶

Your monitoring dashboard lights up: a pod just restarted. kubectl describe pod shows:

Last State:     Terminated
    Reason:     OOMKilled
    Exit Code:  137

Exit code 137. That's 128 + 9 = SIGKILL. The kernel's OOM killer decided your process was using too much memory and executed it without warning. No graceful shutdown, no cleanup, no "I'm about to die" log line. One moment running, the next: dead.

"Out of memory" has at least six different root causes, and the fix for each is different. Is it a memory leak? A misconfigured limit? The JVM eating more than you expected? Page cache fooling you into thinking memory is full when it isn't? A missing swap partition that turns a slow problem into an instant kill?

This lesson teaches you to diagnose OOM kills systematically — confirm it happened, find what caused it, and fix the right layer.

What "Out of Memory" Actually Means¶

First, a critical concept that confuses everyone: Linux intentionally uses all available RAM for caching. High "used" memory is healthy. "Unused RAM is wasted RAM" is the kernel's philosophy.

free -h
#               total   used   free   shared  buff/cache  available
# Mem:           16G    4.2G   512M    128M      11.3G      11.1G
# Swap:          2G       0B    2G

Column	What it means	Do you care?
`used`	Active process memory	Sometimes
`free`	Truly unused (wasted)	No — low free is normal
`buff/cache`	Used for caching, reclaimable under pressure	Not a problem
`available`	Memory available for new processes (free + reclaimable cache)	This is the number that matters

Mental Model: Think of memory like a warehouse. used is inventory on shelves. buff/cache is boxes staged near the door for fast access — they can be moved back to make room. available is the actual floor space you can use. A warehouse with full shelves and staged boxes is efficient, not full.

The system is in trouble when available approaches zero — not when used is high.

The Diagnostic Ladder¶

When you see exit code 137 or OOMKilled:

OOM Kill detected
│
├── Confirm it's actually OOM (not app crash)
│   dmesg | grep -i "oom\|killed process"
│
├── Where did the OOM happen?
│   ├── Host-level? → System ran out of memory
│   ├── Cgroup/container? → Container hit its memory limit
│   └── Kubernetes? → Pod exceeded resource limit
│
├── Why is memory high?
│   ├── Memory leak? → RSS growing linearly over time
│   ├── Limit too tight? → RSS spikes transiently, then stabilizes
│   ├── JVM off-heap? → Heap fine, but metaspace/threads/direct buffers growing
│   └── Page cache? → File I/O filling cgroup limit, not actual process memory
│
└── Fix the right layer
    ├── Leak → Find and fix the code (don't just raise the limit)
    ├── Tight limit → Set limit = 1.5-2x observed peak
    ├── JVM → Cap every memory region explicitly
    └── Page cache → Separate file I/O to volumes outside the cgroup

Step 1: Confirm It's Actually OOM¶

Exit code 137 is SIGKILL, which usually means OOM — but could also be a manual kill -9 or Kubernetes eviction. Confirm with the kernel log:

# On the host (or node)
sudo dmesg | grep -i "out of memory\|oom\|killed process" | tail -10
# → [Mar 22 14:23:01] Out of memory: Killed process 5678 (java) total-vm:4194304kB,
#   anon-rss:2097152kB, file-rss:0kB, shmem-rss:0kB, UID:1000,
#   oom_score_adj:1000

# In Kubernetes
kubectl describe pod myapp | grep -A5 "Last State"
kubectl get events -A | grep -i oom

The dmesg output tells you exactly what happened: - total-vm = total virtual memory (requested, not all resident) - anon-rss = anonymous pages (heap, stack — the real memory usage) - file-rss = file-backed pages (cache) - oom_score_adj = priority for killing (1000 = kill first, -1000 = protect)

Step 2: Host OOM vs Container OOM¶

This distinction matters because the fix is completely different.

Host-level OOM¶

The entire system ran out of memory. The OOM killer picks a victim from ALL processes:

# System memory state
free -h
# If 'available' is near zero, the system is under pressure

# What processes are using the most memory?
ps aux --sort=-%mem | head -10

# Check for swap activity (si/so columns)
vmstat 1 5
# si > 0 = swapping in, so > 0 = swapping out
# Heavy swap = system thrashing before OOM

Container/cgroup OOM¶

The container hit its own memory limit, even though the host might have plenty of RAM:

# Check the container's cgroup limit vs usage
# Cgroups v2 (modern):
cat /sys/fs/cgroup/memory.max        # Hard limit
cat /sys/fs/cgroup/memory.current    # Current usage

# Docker:
docker stats --no-stream mycontainer
# → MEM USAGE / LIMIT     256MiB / 512MiB

# Kubernetes:
kubectl top pod myapp
kubectl describe pod myapp | grep -A5 "Limits"

Gotcha: docker stats and Kubernetes container_memory_usage_bytes include page cache, which is reclaimable. This can make it look like your container is at its limit when it's actually fine. The metric that matters is container_memory_working_set_bytes — that's what the OOM killer looks at. In cgroups v2, check memory.stat for the breakdown between anon (real usage) and file (cache).

Step 3: Why Is Memory High?¶

Pattern A: Memory leak (linear growth)¶

RSS grows steadily over time, never plateaus. Restarting helps temporarily, but usage climbs again.

# Track RSS over time
while true; do
    rss=$(ps -o rss= -p $(pgrep myapp))
    echo "$(date +%H:%M:%S) RSS: ${rss} kB"
    sleep 60
done | tee rss-log.txt

# After a few hours, plot or grep for the trend
# If RSS is climbing linearly: memory leak

Gotcha: The instinct when you see OOMKilled is to raise the limit. If it's a leak, this just delays the kill: 512Mi → crashes in 4 hours. 1Gi → crashes in 8 hours. 2Gi → crashes in 16 hours. The leak is linear; more limit just buys time. Find and fix the leak, don't play whack-a-mole with limits.

War Story: A Fluentd log shipping pod had a plugin with a known memory leak — it held references to parsed buffers after parse failures. 5% of incoming logs were malformed JSON. 800 logs/minute × 5% × ~3KB/leak = 120KB/minute. Imperceptible for hours, but after 14 minutes each pod hit 512MB and got OOM killed. No memory limit was set (BestEffort QoS), so the pod kept growing until it triggered node-level memory pressure, evicting other pods. Fix: update the plugin, set a 600Mi memory limit, add growth-rate alerting.

Pattern B: Limit too tight (transient spikes)¶

RSS spikes during load or garbage collection, then drops. The limit is set to the steady-state value with no headroom for spikes.

# Check current vs limit
kubectl top pod myapp
# → NAME    CPU   MEMORY
# → myapp   50m   480Mi    ← limit is 512Mi, only 32Mi headroom

# During a GC cycle or traffic spike, RSS jumps 30%
# 480Mi × 1.3 = 624Mi > 512Mi → OOMKilled

The fix: set limits above peak, not steady-state.

resources:
  requests:
    memory: "400Mi"    # 95th percentile steady-state
  limits:
    memory: "768Mi"    # 1.5-2x observed peak (including GC spikes)

Remember: Requests = scheduling (how much to reserve). Limits = enforcement (OOM kill threshold). Setting requests == limits gives Guaranteed QoS (protected from eviction) but leaves zero headroom for spikes.

Pattern C: JVM off-heap memory¶

The JVM's -Xmx controls heap size. But the JVM uses much more than heap:

Total JVM memory = Heap (-Xmx)
                 + Metaspace (class metadata, 120-150MB typical)
                 + Thread stacks (1MB per thread × N threads)
                 + Code cache (JIT compiled code, 80-240MB)
                 + Direct buffers (NIO, unbounded by default!)
                 + GC overhead
                 + Native memory (JNI, mapped files)

A common disaster: -Xmx512m in a container with a 512Mi limit. The heap fits, but metaspace + threads + code cache push total memory over the limit → OOMKilled on startup.

# See JVM memory breakdown (requires NMT enabled)
jcmd $(pgrep java) VM.native_memory summary
# →                    Java Heap (reserved=524288KB, committed=524288KB)
# →                       Class (reserved=180224KB, committed=32768KB)
# →                      Thread (reserved=524288KB, committed=524288KB)  ← 512 threads!
# →                        Code (reserved=253952KB, committed=40960KB)
# →                    Internal (reserved=8192KB, committed=8192KB)

# Quick thread count
ps -Lf -p $(pgrep java) | wc -l

The fix: use -XX:MaxRAMPercentage=75.0 instead of -Xmx (it respects the container's cgroup limit), and explicitly cap every region:

java \
  -XX:MaxRAMPercentage=75.0 \
  -Xss512k \                          # Thread stack (down from 1MB default)
  -XX:MaxMetaspaceSize=150M \
  -XX:ReservedCodeCacheSize=64M \
  -XX:MaxDirectMemorySize=128M \
  -jar app.jar

Pattern D: Page cache filling the cgroup¶

File I/O inside a container causes the kernel to cache file pages in memory. This page cache counts toward the container's cgroup memory limit. If the container does heavy file I/O (log writing, temp file processing), the cache can push it over the limit — even though the process's actual heap is small.

# Check memory breakdown inside the cgroup
cat /sys/fs/cgroup/memory.stat
# → anon 104857600       ← heap: 100MB
# → file 419430400       ← page cache: 400MB (!)
# → Total is 500MB, but only 100MB is "real" usage

The kernel should reclaim page cache under pressure, but sometimes the reclaim isn't fast enough and the OOM killer fires. Fix: use volumes for heavy file I/O (volumes are outside the container's cgroup), or increase the memory limit with the understanding that cache is reclaimable.

Step 4: The Swap Question¶

Swap changes the OOM behavior dramatically:

Scenario	What happens
No swap (Kubernetes default)	OOM kill is instant. No warning, no degradation — just death.
With swap	System degrades gradually. Pages move to disk, latency spikes, then eventually OOM if swap fills too.

Kubernetes runs without swap by default (until 1.28, which added alpha swap support). This means every memory spike is a potential instant kill — there's no gradual degradation to warn you.

# Check if swap exists
swapon --show
# → (empty) = no swap

# Check swappiness (how aggressively kernel uses swap)
cat /proc/sys/vm/swappiness
# → 60 (default, somewhat aggressive)
# → 10 (better for databases — prefer dropping cache over swapping)

Mental Model: Swap is like an overflow parking lot. Without it, when the main lot is full, the next car gets towed (killed). With it, cars overflow to a slower lot (disk I/O penalty) but nobody gets towed until both lots are full. Kubernetes decided the latency penalty of swap was worse than killing and restarting pods — which is true for most microservices but debatable for stateful workloads.

Kubernetes QoS and Eviction Order¶

When a Kubernetes node runs low on memory, it evicts pods in a specific order based on Quality of Service class:

QoS Class	Condition	`oom_score_adj`	Evicted
BestEffort	No requests or limits set	1000	First
Burstable	Requests < limits	2–999	Second
Guaranteed	Requests == limits	-997	Last

# Check a pod's QoS class
kubectl get pod myapp -o jsonpath='{.status.qosClass}'
# → Burstable

# Check node memory pressure
kubectl describe node mynode | grep -A5 "Conditions"
# → MemoryPressure   True   ...

Gotcha: Forgetting sidecar memory. Your app container uses 400Mi, but the Istio sidecar uses 150Mi. The pod total is 550Mi, but you only set limits on the app container. The sidecar's memory is unaccounted, leading to node-level overcommit and random evictions. Check all containers: kubectl top pod --containers.

The Complete Decision Tree¶

Exit code 137 / OOMKilled
│
├── Confirm: dmesg | grep "oom\|killed process"
│   └── No OOM in dmesg? → Not OOM. Check app logs for crash.
│
├── Where: Host or container?
│   ├── Host: free -h shows available ≈ 0 → system OOM
│   └── Container: docker stats / kubectl top → cgroup limit hit
│
├── Why: Leak, tight limit, JVM, or cache?
│   ├── RSS grows linearly → Leak. Profile and fix code.
│   ├── RSS spikes then drops → Tight limit. Increase to 1.5-2x peak.
│   ├── JVM: heap fine, total high → Off-heap. Cap metaspace/threads/buffers.
│   └── memory.stat shows high 'file' → Page cache. Use volumes or raise limit.
│
└── Kubernetes-specific
    ├── No limits set? → BestEffort QoS → killed first. Set limits.
    ├── requests == limits? → No headroom for spikes. Add 50% buffer.
    └── Sidecar memory? → Check all containers, not just app.

Flashcard Check¶

Q1: free -h shows 512MB "free" but 11GB "available." Is the system low on memory?

No. 11GB is available (free + reclaimable cache). Low "free" is normal and healthy. Only worry when "available" approaches zero.

Q2: Exit code 137 — what happened?

128 + 9 = SIGKILL. Almost always the OOM killer. Confirm with dmesg | grep oom. Could also be manual kill -9 — dmesg disambiguates.

Q3: Container docker stats shows 510Mi / 512Mi. Is it about to OOM?

Maybe not. docker stats includes page cache, which is reclaimable. Check memory.stat for the anon vs file breakdown. The OOM killer looks at working_set_bytes, not total usage.

Q4: JVM has -Xmx512m in a 512Mi container. Will it work?

No. The JVM uses heap + metaspace + thread stacks + code cache + direct buffers. Total is often 1.5-2x the heap. Use -XX:MaxRAMPercentage=75.0 instead.

Q5: RSS grows from 200Mi to 800Mi over 8 hours, then OOM. You raise the limit to 2Gi. What happens?

It takes 32 hours instead of 8, then OOMs again. It's a leak — more limit just delays death. Find and fix the leak.

Q6: What are the three Kubernetes QoS classes?

Guaranteed (requests == limits, killed last), Burstable (requests < limits, middle), BestEffort (no limits, killed first).

Exercises¶

Exercise 1: Read the OOM evidence (investigation)¶

On any Linux system:

# Check for past OOM kills
sudo dmesg | grep -i "oom\|killed process"

# Check current memory state
free -h
cat /proc/meminfo | head -10

# Check memory pressure (kernel 5.2+)
cat /proc/pressure/memory 2>/dev/null

What does each metric tell you? Is the system healthy?

Exercise 2: Trigger a cgroup OOM (hands-on, requires cgroups v2)¶

# Create a memory-limited cgroup
sudo mkdir /sys/fs/cgroup/test-oom
echo "50M" | sudo tee /sys/fs/cgroup/test-oom/memory.max

# Run a process that tries to use more than 50M
sudo sh -c 'echo $$ > /sys/fs/cgroup/test-oom/cgroup.procs && python3 -c "
a = []
while True:
    a.append(b\"x\" * 1024 * 1024)  # 1MB per iteration
"'

# In another terminal, watch:
sudo dmesg | tail -5
# → Memory cgroup out of memory: Killed process ...

# Clean up
sudo rmdir /sys/fs/cgroup/test-oom

Exercise 3: The decision (think, don't code)¶

For each scenario, identify the root cause and fix:

A Python web app OOMs every 3 days. RSS grows linearly.
A Java service OOMs during daily traffic peaks but is fine overnight.
A container doing file transcoding OOMs despite the process using only 200MB of heap.
Random pods on a node get OOMKilled even though they're well within their limits.
A pod with -Xmx4g gets OOMKilled immediately on startup in a 4Gi container.

Answers

1. **Memory leak.** Profile with `tracemalloc` or `objgraph`. Don't raise the limit — it just delays the crash from 3 days to 6. 2. **Tight limit.** The limit matches steady-state but not peak. Check `kubectl top` during peak hours. Set limit to 1.5-2x the observed peak. 3. **Page cache.** File I/O fills the cgroup's page cache, which counts toward the limit. Use a volume for the transcoding work directory, or increase the limit with the understanding that most of it is reclaimable cache. 4. **Node overcommit.** Total pod requests exceed node capacity, or BestEffort pods have no limits. A single pod spiking causes node memory pressure, and the kubelet evicts BestEffort/Burstable pods. Set limits on all pods; check `kubectl describe node` for allocated resources. 5. **JVM off-heap.** `-Xmx4g` is just the heap. Metaspace (150MB), thread stacks (500 threads × 1MB = 500MB), code cache (240MB), direct buffers — total is ~5.5GB, exceeding the 4Gi container limit. Use `-XX:MaxRAMPercentage=75.0` and cap every region.

Cheat Sheet¶

Quick OOM Diagnosis¶

Step	Command	What it tells you
Confirm OOM	`dmesg \\| grep -i "oom\\|killed process"`	Kernel OOM evidence
System memory	`free -h`	Check "available", not "free"
Process memory	`ps aux --sort=-%mem \\| head`	Top memory consumers
Container memory	`docker stats` / `kubectl top pod`	Usage vs limit
Memory breakdown	`cat /sys/fs/cgroup/memory.stat`	anon vs file vs kernel
JVM memory	`jcmd PID VM.native_memory summary`	Heap, metaspace, threads, etc.
Swap activity	`vmstat 1 5` (si/so columns)	Is system thrashing?
K8s QoS	`kubectl get pod -o jsonpath='{.status.qosClass}'`	Eviction priority

Kubernetes Resource Strategy¶

resources:
  requests:
    memory: "400Mi"    # 95th percentile steady-state (scheduling)
  limits:
    memory: "768Mi"    # 1.5-2x observed peak (OOM threshold)

JVM in Containers¶

java -XX:MaxRAMPercentage=75.0 \
     -Xss512k \
     -XX:MaxMetaspaceSize=150M \
     -XX:ReservedCodeCacheSize=64M \
     -XX:MaxDirectMemorySize=128M \
     -jar app.jar

Takeaways¶

"Free" memory being low is normal. Linux uses RAM for caching. Check "available," not "free." Unused RAM is wasted RAM.
Exit code 137 = SIGKILL. Confirm it's OOM with dmesg, not just the exit code.
Container OOM ≠ host OOM. A container can OOM even with plenty of host RAM — it hit its cgroup limit. Check the limit, not just the host.
Don't raise limits for leaks. If RSS grows linearly, more memory just delays death. Profile and fix the code.
JVM total ≠ heap. Metaspace, threads, code cache, and direct buffers can double the heap size. Use -XX:MaxRAMPercentage and cap everything explicitly.
Set limits on everything. BestEffort pods (no limits) are killed first. A single unleashed pod can trigger node-wide evictions.

The Hanging Deploy — processes, signals, and what happens when systemd kills things
The Disk That Filled Up — the other resource exhaustion emergency