- linux
- l1
- runbook
- linux-memory-management --- Portal | Level: L1: Foundations | Topics: Linux Memory Management | Domain: Linux
Runbook: OOM Killer Activated¶
| Field | Value |
|---|---|
| Domain | Linux |
| Alert | node_vmstat_oom_kill > 0 or dmesg shows "Out of memory: Killed process" |
| Severity | P1 |
| Est. Resolution Time | 20-40 minutes |
| Escalation Timeout | 30 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | SSH access to the node, sudo or root access, access to process metrics (Prometheus or top) |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
free -h && dmesg | grep -i "oom\|killed process" | tail -20
Step 1: Check dmesg for OOM Kill Events¶
Why: The kernel logs exactly which process was killed, its PID, how much memory it used, and what the system-wide memory state was at the time of the kill.
# Show all OOM events from the current boot
sudo dmesg | grep -i "out of memory\|oom_kill\|killed process" | tail -30
# More detailed: show the full OOM block including memory map
sudo dmesg | grep -B5 -A20 "Out of memory"
# Check systemd journal for OOM events (persists across reboots)
sudo journalctl -k | grep -i "oom\|killed process" | tail -30
[12345.678901] Out of memory: Killed process 12345 (java) total-vm:4096000kB, anon-rss:3900000kB, file-rss:10000kB
[12345.678902] oom_reaper: reaped process 12345 (java), now anon-rss:0kB, file-rss:0kB
/var/log/syslog or /var/log/kern.log for historical OOM messages.
Step 2: Identify Which Process Was Killed¶
Why: The OOM killer picks a process using a score based on memory usage and other factors. Knowing which process was killed tells you which service needs attention.
# Extract process names from dmesg OOM events
sudo dmesg | grep -i "killed process" | awk '{print $5, $6}'
# Check if the service is currently running (it may have been restarted by systemd)
systemctl status <SERVICE_NAME>
# For Kubernetes pods: find the OOMKilled container
kubectl get pods -A -o json | jq -r \
'.items[] | select(.status.containerStatuses[]?.lastState.terminated.reason == "OOMKilled") |
"\(.metadata.namespace)/\(.metadata.name): \(.status.containerStatuses[].lastState.terminated.reason)"'
# Example: process 12345 (java) was killed
# In Kubernetes context:
production/myapp-6d8fb9cc4-xxxx: OOMKilled
/proc — though the PID may have been reused. Check service logs for crash times.
Step 3: Check Current Memory Usage¶
Why: Understanding current memory pressure shows whether the system is still at risk of more OOM kills or whether it has recovered.
# Overall memory summary
free -h
# Detailed memory breakdown by process
ps aux --sort=-%mem | head -20
# Check for memory pressure via vmstat
vmstat -s | head -20
# Check cgroup memory limits (for containerized workloads)
cat /sys/fs/cgroup/memory/memory.usage_in_bytes 2>/dev/null
cat /sys/fs/cgroup/memory/memory.limit_in_bytes 2>/dev/null
# For cgroup v2 (systemd 232+, kernel 4.5+)
cat /sys/fs/cgroup/memory.current 2>/dev/null
cat /sys/fs/cgroup/memory.max 2>/dev/null
available is under 500MB with no swap, the system is at immediate risk of more OOM kills. Consider restarting the memory-hungry process immediately before continuing investigation.
Step 4: Check for Memory Leaks¶
Why: A process that was killed may restart and begin leaking again. Catching an upward trend before the next OOM kill is critical.
# Watch memory usage of the suspect process over time
watch -n 5 'ps -o pid,rss,vsz,comm -p <PID>'
# For a more detailed per-mapping view
cat /proc/<PID>/status | grep -i vmrss
# Check if RSS is growing over time (run these 60 seconds apart)
ps -o pid,rss,comm -p <PID>
sleep 60
ps -o pid,rss,comm -p <PID>
# Check smaps for heap growth (requires root)
sudo cat /proc/<PID>/smaps | grep -A1 "Heap" | grep Size
# Healthy: RSS stays flat or grows slowly and plateaus
# Memory leak: RSS grows by hundreds of MB per minute
jmap -heap <PID>. For Python: use tracemalloc or memory_profiler.
Step 5: Tune vm.overcommit_memory if Needed¶
Why: The Linux kernel's memory overcommit setting affects how aggressively memory is allocated. The wrong setting can make OOM kills more frequent.
# Check current overcommit setting
cat /proc/sys/vm/overcommit_memory
# 0 = heuristic (default), 1 = always overcommit, 2 = never overcommit
cat /proc/sys/vm/overcommit_ratio
# Only relevant when overcommit_memory=2
# Check OOM score adjustments for critical processes
cat /proc/<CRITICAL_PID>/oom_score_adj
# Protect a critical process from OOM kill (set to -1000 to make unkillable)
# WARNING: Only do this for truly critical system processes
sudo echo -1000 > /proc/<CRITICAL_PID>/oom_score_adj
# Make oom_score_adj permanent via systemd unit override
sudo systemctl edit <SERVICE_NAME>
# Add: [Service]
# OOMScoreAdjust=-500
overcommit_memory=2 (never overcommit) without understanding all memory consumers on the node. It can cause processes to fail to allocate memory even when plenty appears available.
Step 6: Add Swap or Increase RAM¶
Why: Swap gives the kernel somewhere to page anonymous memory rather than killing processes. It slows things down but prevents immediate crashes.
# Check if swap exists
swapon --show
# Create a 4GB swapfile (temporary fix — add RAM for permanent fix)
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Verify swap is active
free -h
# Make it permanent (add to /etc/fstab)
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
# For cloud nodes — resize the instance type for more RAM
# (AWS example)
aws ec2 modify-instance-attribute --instance-id <INSTANCE_ID> --instance-type <NEW_TYPE>
Verification¶
Success looks like: Available memory above 20% of total. No new OOM events in dmesg since the fix. The killed service is running stably. If still broken: Escalate — see below.Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 30 min | Infrastructure on-call | "OOM killer repeatedly firing on node |
| Data loss suspected | Application team lead | "OOM kill may have corrupted in-flight writes for |
| Scope expanding to multiple nodes | SRE lead | "OOM kills across multiple nodes, possible memory leak in shared workload or cluster memory misconfiguration" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
Common Mistakes¶
- Restarting the killed process without understanding why it used so much memory: If you just restart the service without investigating, it will grow to the same size and get killed again — usually faster the second time. Always capture a memory profile or heap dump before restarting.
- Not setting cgroup memory limits: In Kubernetes, always set
resources.limits.memoryon containers. Without limits, one leaky container can starve all others on the node and trigger node-level OOM kills. - Forgetting to check swap: Some production nodes have swap disabled intentionally (e.g., Kubernetes requirement). But if the intent was to have swap and it is missing, adding it can prevent OOM kills while a permanent fix is implemented.
Cross-References¶
- Topic Pack: Linux Memory Management (deep background)
- Related Runbook: High CPU (Runaway Process)
Wiki Navigation¶
Related Content¶
- Linux Memory Flashcards (CLI) (flashcard_deck, L1) — Linux Memory Management
- Linux Memory Management (Topic Pack, L1) — Linux Memory Management