Solution¶

Triage¶

Read OOM killer messages from the kernel log:

dmesg -T | grep -i "oom\|out of memory\|killed process"

Check current memory state:

cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|SwapTotal|SwapFree"

Check the Java process memory footprint:

ps aux --sort=-%mem | head -10
cat /proc/$(pgrep -f order-processor)/status | grep -E "VmRSS|VmSize|VmSwap"

Check if cgroup limits are in play:

cat /sys/fs/cgroup/memory/system.slice/order-processor.service/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/system.slice/order-processor.service/memory.usage_in_bytes

Root Cause¶

The JVM is configured with -Xmx12g on a host with 16 GB total RAM. The heap alone can consume 12 GB, but JVM total memory includes: - Heap: up to 12 GB - Metaspace: ~256 MB - Thread stacks: ~1 GB (1 MB per thread x ~1000 threads) - Code cache: ~240 MB - Native memory (NIO buffers, JNI): variable, can be several hundred MB - OS kernel and other processes: ~1-2 GB

Total JVM consumption can reach 14-15 GB, leaving less than 1 GB for the OS and other processes. When the OS runs out of memory with no swap available, the OOM killer selects the process with the highest oom_score (the Java process, due to its size) and kills it with SIGKILL.

The steady memory climb suggests either a native memory leak or growing metaspace/code cache.

Fix¶

Immediate: Reduce the JVM heap size to leave headroom:
```
# In the systemd unit or startup script, change:
-Xmx12g  →  -Xmx10g
```
This leaves ~4 GB for JVM overhead + OS + other processes.

Enable native memory tracking to find the leak source:

-XX:NativeMemoryTracking=summary

Then periodically check:

jcmd $(pgrep -f order-processor) VM.native_memory summary

Set a cgroup memory limit via the systemd unit to make failures more predictable:
```
[Service]
MemoryMax=14G
MemoryHigh=13G
```
This triggers cgroup OOM (predictable, logged) before the system OOM killer (indiscriminate).
Set up monitoring alerts for memory usage crossing 80% and 90% thresholds.

Rollback / Safety¶

Reducing -Xmx may cause the application to run out of heap and throw OutOfMemoryError. Monitor GC logs after the change.
If the application genuinely needs 12 GB of heap, increase the server RAM to 32 GB.
Do not enable swap on servers running latency-sensitive Java applications; GC pauses will become extreme.

Common Traps¶

Assuming -Xmx is the total JVM memory. It is only the heap. Real JVM memory can be 20-40% higher.
Blaming the OOM killer instead of the memory configuration. The OOM killer is a symptom, not the cause.
Enabling swap as a fix. Swap masks the problem and causes severe latency spikes when the JVM GCs pages that are swapped out.
Not checking oom_score_adj. If another critical process (like sshd) gets killed instead of the Java process, the server becomes unreachable.
Ignoring native memory leaks. Direct ByteBuffers and JNI allocations are not tracked by -Xmx and can grow without bound.