Skip to content

Anti-Primer: Oomkilled

Everything that can go wrong, will — and in this story, it does.

The Setup

A Java application on Kubernetes keeps getting OOMKilled after a seemingly innocent dependency upgrade. The team has 4 hours before the SLA breach timer starts. Nobody on the team fully understands JVM memory management in containers.

The Timeline

Hour 0: JVM Ignores Container Limits

Runs Java 8 without container-aware JVM flags; JVM sees host memory, not container limit. The deadline was looming, and this seemed like the fastest path forward. But the result is JVM allocates 25% of node memory (32GB) but the container limit is 2GB; instant OOMKill.

Footgun #1: JVM Ignores Container Limits — runs Java 8 without container-aware JVM flags; JVM sees host memory, not container limit, leading to JVM allocates 25% of node memory (32GB) but the container limit is 2GB; instant OOMKill.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: Memory Limit Equals Request

Sets memory request and limit to exactly the same value with no headroom. Under time pressure, the team chose speed over caution. But the result is any traffic spike that causes a temporary allocation bump triggers OOMKill immediately.

Footgun #2: Memory Limit Equals Request — sets memory request and limit to exactly the same value with no headroom, leading to any traffic spike that causes a temporary allocation bump triggers OOMKill immediately.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Off-Heap Memory Ignored

Sets -Xmx to match the container limit, forgetting thread stacks, metaspace, and native memory. Nobody pushed back because the shortcut looked harmless in the moment. But the result is JVM heap fits but total process memory exceeds the container limit; OOMKilled under load.

Footgun #3: Off-Heap Memory Ignored — sets -Xmx to match the container limit, forgetting thread stacks, metaspace, and native memory, leading to JVM heap fits but total process memory exceeds the container limit; OOMKilled under load.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: Monitoring Only Heap

Grafana dashboard shows JVM heap at 60% utilization; team concludes memory is fine. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is RSS (resident set size) is 3x the heap; OOMKilled despite 'healthy' metrics.

Footgun #4: Monitoring Only Heap — Grafana dashboard shows JVM heap at 60% utilization; team concludes memory is fine, leading to RSS (resident set size) is 3x the heap; OOMKilled despite 'healthy' metrics.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 JVM Ignores Container Limits JVM allocates 25% of node memory (32GB) but the container limit is 2GB; instant OOMKill Primer: Use Java 11+ or add -XX:+UseContainerSupport and -XX:MaxRAMPercentage
2 Memory Limit Equals Request Any traffic spike that causes a temporary allocation bump triggers OOMKill immediately Primer: Set request at typical usage and limit with 20-30% headroom
3 Off-Heap Memory Ignored JVM heap fits but total process memory exceeds the container limit; OOMKilled under load Primer: Account for off-heap memory: -Xmx should be ~60-70% of container memory limit
4 Monitoring Only Heap RSS (resident set size) is 3x the heap; OOMKilled despite 'healthy' metrics Primer: Monitor container RSS, not just JVM heap; use container_memory_working_set_bytes

Damage Report

  • Downtime: 2-4 hours of pod-level or cluster-wide disruption
  • Data loss: Risk of volume data loss if StatefulSets were affected
  • Customer impact: Intermittent 5xx errors, dropped connections, or full service outage
  • Engineering time to remediate: 10-20 engineer-hours for incident response, rollback, and postmortem
  • Reputation cost: On-call fatigue; delayed feature work; possible SLA breach notification

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on jvm ignores container limits, they would have learned: Use Java 11+ or add -XX:+UseContainerSupport and -XX:MaxRAMPercentage.
  • Footgun #2: If the engineer had read the primer, section on memory limit equals request, they would have learned: Set request at typical usage and limit with 20-30% headroom.
  • Footgun #3: If the engineer had read the primer, section on off-heap memory ignored, they would have learned: Account for off-heap memory: -Xmx should be ~60-70% of container memory limit.
  • Footgun #4: If the engineer had read the primer, section on monitoring only heap, they would have learned: Monitor container RSS, not just JVM heap; use container_memory_working_set_bytes.

Cross-References