Skip to content

Pattern: OOM Without Swap Buffer

ID: FP-004 Family: Resource Exhaustion Frequency: Common Blast Radius: Single Pod to Multi-Service Detection Difficulty: Obvious

The Shape

When physical memory is exhausted and swap is disabled (or absent), the OOM killer fires immediately and terminates processes without warning. With swap, the system degrades gradually (latency spikes as it pages) giving operators time to respond. Without swap, the transition from "running fine" to "process killed" is instantaneous and leaves no evidence of what caused the memory pressure.

How You'll See It

In Kubernetes

Pod status is OOMKilled. kubectl describe pod shows:

Last State: Terminated
  Reason: OOMKilled
  Exit Code: 137
The pod restarts immediately (if restartPolicy allows), potentially entering CrashLoopBackOff. Memory limit was set equal to the request (see also FP-005), leaving no headroom for traffic spikes. Kubernetes nodes intentionally run without swap by default.

In Linux/Infrastructure

/var/log/kern.log or dmesg shows:

Out of memory: Kill process 12345 (java) score 500 or sacrifice child
Killed process 12345 (java) total-vm:16777216kB, anon-rss:14680064kB
JVM with -Xmx set close to physical RAM, plus JVM overhead (metaspace, code cache, native memory) exceeded available physical memory. Without swap, the kill is instant.

In CI/CD

Parallel test runners, each with a JVM or Python process, collectively exceed node RAM. One runner is killed mid-test. CI reports a flaky test rather than an OOM event because the process exit code (137 = SIGKILL) is not always surfaced in test framework output.

The Tell

Exit code 137 (SIGKILL) or kernel log "Out of memory: Kill process...". No gradual degradation — the process simply disappears. With swap disabled: the kill is instant; with swap enabled: latency spikes precede it.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Flaky test OOM kill Exit code 137; dmesg shows OOM killer activity
Application crash OOM kill No application-level exception; kernel log has the record
Memory leak Memory limit too low RSS grows to limit then process killed; limit may be correct for peak load

The Fix (Generic)

  1. Immediate: Identify the killed process and its peak RSS; increase memory limit or reduce concurrency.
  2. Short-term: Add swap as a safety valve on non-K8s hosts; in Kubernetes, set limits.memory at 1.5x–2x of requests.memory to absorb spikes.
  3. Long-term: Profile application memory under peak load; use memory profilers (pprof, jmap, tracemalloc); set alerts on memory pressure before the OOM fires.

Real-World Examples

  • Example 1: JVM application with -Xmx=12g on a 16GB host. JVM overhead (metaspace, thread stacks, code cache) consumed 2–3GB beyond the heap, totaling 14–15GB. Occasional GC pressure pushed above 16GB; OOM killer fired, killing the JVM instantly.
  • Example 2: Kubernetes pod with limits.memory: 256Mi and requests.memory: 256Mi. Normal load: 230Mi. One large payload request caused GC pressure to 260Mi; pod OOMKilled mid-request.

War Story

A team was dealing with "random" Java crashes — the process would vanish, no stack trace, no log. They were convinced it was a JVM bug or a signal from the OS. dmesg | grep -i kill instantly showed "Out of memory: Kill process 23441 (java)". The JVM was set to 12GB on a 16GB machine, and the GC scan itself needed extra memory during full GC. Adding 8GB of swap gave us visibility: the next time it happened, we saw latency spike for 30 seconds before recovery — and we could fix the memory leak rather than just watching the crash.

Cross-References