Skip to content

Runbook: OOM Killer Activated

Field Value
Domain Linux
Alert node_vmstat_oom_kill > 0 or dmesg shows "Out of memory: Killed process"
Severity P1
Est. Resolution Time 20-40 minutes
Escalation Timeout 30 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites SSH access to the node, sudo or root access, access to process metrics (Prometheus or top)

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
free -h && dmesg | grep -i "oom\|killed process" | tail -20
If output shows: Recent OOM kill events in dmesg AND available memory is near zero → Active memory pressure, proceed from Step 1 urgently If output shows: OOM kills in dmesg but memory looks OK now → Process was killed and freed memory, system recovered — investigate root cause from Step 2 to prevent recurrence

Step 1: Check dmesg for OOM Kill Events

Why: The kernel logs exactly which process was killed, its PID, how much memory it used, and what the system-wide memory state was at the time of the kill.

# Show all OOM events from the current boot
sudo dmesg | grep -i "out of memory\|oom_kill\|killed process" | tail -30

# More detailed: show the full OOM block including memory map
sudo dmesg | grep -B5 -A20 "Out of memory"

# Check systemd journal for OOM events (persists across reboots)
sudo journalctl -k | grep -i "oom\|killed process" | tail -30
Expected output:
[12345.678901] Out of memory: Killed process 12345 (java) total-vm:4096000kB, anon-rss:3900000kB, file-rss:10000kB
[12345.678902] oom_reaper: reaped process 12345 (java), now anon-rss:0kB, file-rss:0kB
If this fails: If dmesg is empty or shows nothing, OOM kills may have happened before last boot. Check /var/log/syslog or /var/log/kern.log for historical OOM messages.

Step 2: Identify Which Process Was Killed

Why: The OOM killer picks a process using a score based on memory usage and other factors. Knowing which process was killed tells you which service needs attention.

# Extract process names from dmesg OOM events
sudo dmesg | grep -i "killed process" | awk '{print $5, $6}'

# Check if the service is currently running (it may have been restarted by systemd)
systemctl status <SERVICE_NAME>

# For Kubernetes pods: find the OOMKilled container
kubectl get pods -A -o json | jq -r \
  '.items[] | select(.status.containerStatuses[]?.lastState.terminated.reason == "OOMKilled") |
   "\(.metadata.namespace)/\(.metadata.name): \(.status.containerStatuses[].lastState.terminated.reason)"'
Expected output:
# Example: process 12345 (java) was killed
# In Kubernetes context:
production/myapp-6d8fb9cc4-xxxx: OOMKilled
If this fails: If the process name is truncated in dmesg (kernel truncates at 15 chars), use the PID to look it up in /proc — though the PID may have been reused. Check service logs for crash times.

Step 3: Check Current Memory Usage

Why: Understanding current memory pressure shows whether the system is still at risk of more OOM kills or whether it has recovered.

# Overall memory summary
free -h

# Detailed memory breakdown by process
ps aux --sort=-%mem | head -20

# Check for memory pressure via vmstat
vmstat -s | head -20

# Check cgroup memory limits (for containerized workloads)
cat /sys/fs/cgroup/memory/memory.usage_in_bytes 2>/dev/null
cat /sys/fs/cgroup/memory/memory.limit_in_bytes 2>/dev/null

# For cgroup v2 (systemd 232+, kernel 4.5+)
cat /sys/fs/cgroup/memory.current 2>/dev/null
cat /sys/fs/cgroup/memory.max 2>/dev/null
Expected output:
              total        used        free      shared  buff/cache   available
Mem:            31G        28G       500M       1.2G       2.5G       1.8G
Swap:            0B          0B          0B
If this fails: If available is under 500MB with no swap, the system is at immediate risk of more OOM kills. Consider restarting the memory-hungry process immediately before continuing investigation.

Step 4: Check for Memory Leaks

Why: A process that was killed may restart and begin leaking again. Catching an upward trend before the next OOM kill is critical.

# Watch memory usage of the suspect process over time
watch -n 5 'ps -o pid,rss,vsz,comm -p <PID>'

# For a more detailed per-mapping view
cat /proc/<PID>/status | grep -i vmrss

# Check if RSS is growing over time (run these 60 seconds apart)
ps -o pid,rss,comm -p <PID>
sleep 60
ps -o pid,rss,comm -p <PID>

# Check smaps for heap growth (requires root)
sudo cat /proc/<PID>/smaps | grep -A1 "Heap" | grep Size
Expected output:
# Healthy: RSS stays flat or grows slowly and plateaus
# Memory leak: RSS grows by hundreds of MB per minute
If this fails: If the process does not have a stable PID (keeps restarting), attach a memory profiler before it crashes again. For Java: jmap -heap <PID>. For Python: use tracemalloc or memory_profiler.

Step 5: Tune vm.overcommit_memory if Needed

Why: The Linux kernel's memory overcommit setting affects how aggressively memory is allocated. The wrong setting can make OOM kills more frequent.

# Check current overcommit setting
cat /proc/sys/vm/overcommit_memory
# 0 = heuristic (default), 1 = always overcommit, 2 = never overcommit

cat /proc/sys/vm/overcommit_ratio
# Only relevant when overcommit_memory=2

# Check OOM score adjustments for critical processes
cat /proc/<CRITICAL_PID>/oom_score_adj

# Protect a critical process from OOM kill (set to -1000 to make unkillable)
# WARNING: Only do this for truly critical system processes
sudo echo -1000 > /proc/<CRITICAL_PID>/oom_score_adj

# Make oom_score_adj permanent via systemd unit override
sudo systemctl edit <SERVICE_NAME>
# Add: [Service]
#      OOMScoreAdjust=-500
Expected output:
# overcommit_memory = 0 is appropriate for most workloads
0
If this fails: Do not set overcommit_memory=2 (never overcommit) without understanding all memory consumers on the node. It can cause processes to fail to allocate memory even when plenty appears available.

Step 6: Add Swap or Increase RAM

Why: Swap gives the kernel somewhere to page anonymous memory rather than killing processes. It slows things down but prevents immediate crashes.

# Check if swap exists
swapon --show

# Create a 4GB swapfile (temporary fix — add RAM for permanent fix)
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Verify swap is active
free -h

# Make it permanent (add to /etc/fstab)
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

# For cloud nodes — resize the instance type for more RAM
# (AWS example)
aws ec2 modify-instance-attribute --instance-id <INSTANCE_ID> --instance-type <NEW_TYPE>
Expected output:
              total        used        free
Swap:          4.0G          0B       4.0G
If this fails: If a swapfile cannot be created (disk full), free disk space first — see Disk Full runbook. Note: Kubernetes nodes often have swap disabled by design. Adding swap to a kube node requires kubelet config changes to allow it.

Verification

# Confirm the issue is resolved
free -h
sudo dmesg | grep -c "Out of memory"
Success looks like: Available memory above 20% of total. No new OOM events in dmesg since the fix. The killed service is running stably. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 30 min Infrastructure on-call "OOM killer repeatedly firing on node , memory exhausted, critical processes being killed"
Data loss suspected Application team lead "OOM kill may have corrupted in-flight writes for on "
Scope expanding to multiple nodes SRE lead "OOM kills across multiple nodes, possible memory leak in shared workload or cluster memory misconfiguration"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete

Common Mistakes

  1. Restarting the killed process without understanding why it used so much memory: If you just restart the service without investigating, it will grow to the same size and get killed again — usually faster the second time. Always capture a memory profile or heap dump before restarting.
  2. Not setting cgroup memory limits: In Kubernetes, always set resources.limits.memory on containers. Without limits, one leaky container can starve all others on the node and trigger node-level OOM kills.
  3. Forgetting to check swap: Some production nodes have swap disabled intentionally (e.g., Kubernetes requirement). But if the intent was to have swap and it is missing, adding it can prevent OOM kills while a permanent fix is implemented.

Cross-References


Wiki Navigation

  • Linux Memory Flashcards (CLI) (flashcard_deck, L1) — Linux Memory Management
  • Linux Memory Management (Topic Pack, L1) — Linux Memory Management