Solution¶
Triage¶
- Read OOM killer messages from the kernel log:
- Check current memory state:
- Check the Java process memory footprint:
- Check if cgroup limits are in play:
Root Cause¶
The JVM is configured with -Xmx12g on a host with 16 GB total RAM. The heap alone can consume 12 GB, but JVM total memory includes:
- Heap: up to 12 GB
- Metaspace: ~256 MB
- Thread stacks: ~1 GB (1 MB per thread x ~1000 threads)
- Code cache: ~240 MB
- Native memory (NIO buffers, JNI): variable, can be several hundred MB
- OS kernel and other processes: ~1-2 GB
Total JVM consumption can reach 14-15 GB, leaving less than 1 GB for the OS and other processes. When the OS runs out of memory with no swap available, the OOM killer selects the process with the highest oom_score (the Java process, due to its size) and kills it with SIGKILL.
The steady memory climb suggests either a native memory leak or growing metaspace/code cache.
Fix¶
-
Immediate: Reduce the JVM heap size to leave headroom:
This leaves ~4 GB for JVM overhead + OS + other processes. -
Enable native memory tracking to find the leak source:
Then periodically check: -
Set a cgroup memory limit via the systemd unit to make failures more predictable:
This triggers cgroup OOM (predictable, logged) before the system OOM killer (indiscriminate). -
Set up monitoring alerts for memory usage crossing 80% and 90% thresholds.
Rollback / Safety¶
- Reducing
-Xmxmay cause the application to run out of heap and throwOutOfMemoryError. Monitor GC logs after the change. - If the application genuinely needs 12 GB of heap, increase the server RAM to 32 GB.
- Do not enable swap on servers running latency-sensitive Java applications; GC pauses will become extreme.
Common Traps¶
- Assuming -Xmx is the total JVM memory. It is only the heap. Real JVM memory can be 20-40% higher.
- Blaming the OOM killer instead of the memory configuration. The OOM killer is a symptom, not the cause.
- Enabling swap as a fix. Swap masks the problem and causes severe latency spikes when the JVM GCs pages that are swapped out.
- Not checking oom_score_adj. If another critical process (like sshd) gets killed instead of the Java process, the server becomes unreachable.
- Ignoring native memory leaks. Direct ByteBuffers and JNI allocations are not tracked by -Xmx and can grow without bound.