Pattern: OOM Without Swap Buffer¶
ID: FP-004 Family: Resource Exhaustion Frequency: Common Blast Radius: Single Pod to Multi-Service Detection Difficulty: Obvious
The Shape¶
When physical memory is exhausted and swap is disabled (or absent), the OOM killer fires immediately and terminates processes without warning. With swap, the system degrades gradually (latency spikes as it pages) giving operators time to respond. Without swap, the transition from "running fine" to "process killed" is instantaneous and leaves no evidence of what caused the memory pressure.
How You'll See It¶
In Kubernetes¶
Pod status is OOMKilled. kubectl describe pod shows:
In Linux/Infrastructure¶
/var/log/kern.log or dmesg shows:
Out of memory: Kill process 12345 (java) score 500 or sacrifice child
Killed process 12345 (java) total-vm:16777216kB, anon-rss:14680064kB
-Xmx set close to physical RAM, plus JVM overhead (metaspace, code cache,
native memory) exceeded available physical memory. Without swap, the kill is instant.
In CI/CD¶
Parallel test runners, each with a JVM or Python process, collectively exceed node RAM. One runner is killed mid-test. CI reports a flaky test rather than an OOM event because the process exit code (137 = SIGKILL) is not always surfaced in test framework output.
The Tell¶
Exit code 137 (SIGKILL) or kernel log "Out of memory: Kill process...". No gradual degradation — the process simply disappears. With swap disabled: the kill is instant; with swap enabled: latency spikes precede it.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Flaky test | OOM kill | Exit code 137; dmesg shows OOM killer activity |
| Application crash | OOM kill | No application-level exception; kernel log has the record |
| Memory leak | Memory limit too low | RSS grows to limit then process killed; limit may be correct for peak load |
The Fix (Generic)¶
- Immediate: Identify the killed process and its peak RSS; increase memory limit or reduce concurrency.
- Short-term: Add swap as a safety valve on non-K8s hosts; in Kubernetes, set
limits.memoryat 1.5x–2x ofrequests.memoryto absorb spikes. - Long-term: Profile application memory under peak load; use memory profilers (pprof, jmap, tracemalloc); set alerts on memory pressure before the OOM fires.
Real-World Examples¶
- Example 1: JVM application with
-Xmx=12gon a 16GB host. JVM overhead (metaspace, thread stacks, code cache) consumed 2–3GB beyond the heap, totaling 14–15GB. Occasional GC pressure pushed above 16GB; OOM killer fired, killing the JVM instantly. - Example 2: Kubernetes pod with
limits.memory: 256Miandrequests.memory: 256Mi. Normal load: 230Mi. One large payload request caused GC pressure to 260Mi; pod OOMKilled mid-request.
War Story¶
A team was dealing with "random" Java crashes — the process would vanish, no stack trace, no log. They were convinced it was a JVM bug or a signal from the OS.
dmesg | grep -i killinstantly showed "Out of memory: Kill process 23441 (java)". The JVM was set to 12GB on a 16GB machine, and the GC scan itself needed extra memory during full GC. Adding 8GB of swap gave us visibility: the next time it happened, we saw latency spike for 30 seconds before recovery — and we could fix the memory leak rather than just watching the crash.
Cross-References¶
- Topic Packs: linux-memory-management, k8s-ops
- Case Studies: linux_ops/oom-killer-events/, cross-domain/pod-oomkilled-sidecar-helm/
- Footguns: linux-memory-management/footguns.md
- Related Patterns: FP-005 (cgroup soft/hard confusion — K8s memory limits are cgroups), FP-035 (tight memory limit — the configuration that causes this)