Skip to content

Solution

Triage

  1. Read OOM killer messages from the kernel log:
    dmesg -T | grep -i "oom\|out of memory\|killed process"
    
  2. Check current memory state:
    cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|SwapTotal|SwapFree"
    
  3. Check the Java process memory footprint:
    ps aux --sort=-%mem | head -10
    cat /proc/$(pgrep -f order-processor)/status | grep -E "VmRSS|VmSize|VmSwap"
    
  4. Check if cgroup limits are in play:
    cat /sys/fs/cgroup/memory/system.slice/order-processor.service/memory.limit_in_bytes
    cat /sys/fs/cgroup/memory/system.slice/order-processor.service/memory.usage_in_bytes
    

Root Cause

The JVM is configured with -Xmx12g on a host with 16 GB total RAM. The heap alone can consume 12 GB, but JVM total memory includes: - Heap: up to 12 GB - Metaspace: ~256 MB - Thread stacks: ~1 GB (1 MB per thread x ~1000 threads) - Code cache: ~240 MB - Native memory (NIO buffers, JNI): variable, can be several hundred MB - OS kernel and other processes: ~1-2 GB

Total JVM consumption can reach 14-15 GB, leaving less than 1 GB for the OS and other processes. When the OS runs out of memory with no swap available, the OOM killer selects the process with the highest oom_score (the Java process, due to its size) and kills it with SIGKILL.

The steady memory climb suggests either a native memory leak or growing metaspace/code cache.

Fix

  1. Immediate: Reduce the JVM heap size to leave headroom:

    # In the systemd unit or startup script, change:
    -Xmx12g  →  -Xmx10g
    
    This leaves ~4 GB for JVM overhead + OS + other processes.

  2. Enable native memory tracking to find the leak source:

    -XX:NativeMemoryTracking=summary
    
    Then periodically check:
    jcmd $(pgrep -f order-processor) VM.native_memory summary
    

  3. Set a cgroup memory limit via the systemd unit to make failures more predictable:

    [Service]
    MemoryMax=14G
    MemoryHigh=13G
    
    This triggers cgroup OOM (predictable, logged) before the system OOM killer (indiscriminate).

  4. Set up monitoring alerts for memory usage crossing 80% and 90% thresholds.

Rollback / Safety

  • Reducing -Xmx may cause the application to run out of heap and throw OutOfMemoryError. Monitor GC logs after the change.
  • If the application genuinely needs 12 GB of heap, increase the server RAM to 32 GB.
  • Do not enable swap on servers running latency-sensitive Java applications; GC pauses will become extreme.

Common Traps

  • Assuming -Xmx is the total JVM memory. It is only the heap. Real JVM memory can be 20-40% higher.
  • Blaming the OOM killer instead of the memory configuration. The OOM killer is a symptom, not the cause.
  • Enabling swap as a fix. Swap masks the problem and causes severe latency spikes when the JVM GCs pages that are swapped out.
  • Not checking oom_score_adj. If another critical process (like sshd) gets killed instead of the Java process, the server becomes unreachable.
  • Ignoring native memory leaks. Direct ByteBuffers and JNI allocations are not tracked by -Xmx and can grow without bound.