Incident Replay: OOM Killer Events¶

Setup¶

System context: Production application server with 32GB RAM running a Java application, Redis cache, and system services. The OOM killer has been triggered multiple times, killing critical processes.
Time: Friday 01:30 UTC
Your role: On-call SRE

Round 1: Alert Fires¶

[Pressure cue: "PagerDuty fires — Redis on app-prod-06 was killed by OOM killer. Application is failing cache lookups and hitting the database directly. Database load spiking."]

What you see: dmesg | grep -i "oom\|killed" shows "Out of memory: Killed process 2847 (redis-server) score 120." Redis was chosen by the OOM killer because it had a high OOM score. free -h shows only 200MB free with 28GB used.

Choose your action: - A) Restart Redis immediately - B) Check what is consuming 28GB of memory - C) Add more swap space to prevent future OOM kills - D) Set oom_score_adj = -1000 on Redis to prevent it from being killed

If you chose B (recommended):¶

[Result: ps aux --sort=-%mem | head -10 shows: java (16GB), redis-server (4GB, before kill), nginx (500MB), and a mysterious data-processor script consuming 10GB — it is a batch job that loaded an entire dataset into memory. Proceed to Round 2.]

If you chose A:¶

[Result: Redis restarts but the system is still out of memory. Redis gets killed again within 5 minutes. Sisyphean.]

If you chose C:¶

[Result: Swap slows everything down to a crawl instead of killing processes. Trading quick death for slow agony.]

If you chose D:¶

[Result: Protecting Redis means the OOM killer targets the Java application instead — worse outcome. You need to fix the memory consumption, not redirect the kill.]

Round 2: First Triage Data¶

[Pressure cue: "Database load is 5x normal from lost cache. Need Redis back AND memory usage fixed."]

What you see: The data-processor batch script (10GB) was started by a cron job 2 hours ago. It loads a full CSV dataset into a Python dictionary. On a 32GB server, this plus Java (16GB) plus Redis (4GB) = 30GB, leaving almost nothing for the OS.

Choose your action: - A) Kill the data-processor script and restart Redis - B) Kill only the data-processor, then restart Redis with a memory limit - C) Increase the Java heap limit to leave more room for other services - D) Add 16GB more RAM to the server

If you chose B (recommended):¶

[Result: kill $(pgrep data-processor) frees 10GB. Redis restarted with maxmemory 4gb config. System memory usage drops to 21GB. Stable. Proceed to Round 3.]

If you chose A:¶

[Result: Works but Redis without a memory limit could grow unbounded and cause OOM again.]

If you chose C:¶

[Result: The Java heap is already sized for the workload. Reducing it would cause GC pressure and application slowdowns.]

If you chose D:¶

[Result: Hardware changes take hours or days. Not an immediate fix.]

Round 3: Root Cause Identification¶

[Pressure cue: "Memory stabilized. Prevent recurrence."]

What you see: Root cause: The data-processor batch job was added to the server cron without a memory impact assessment. It loads a 2GB CSV file into Python, which expands to ~10GB in-memory (Python object overhead). The server's memory budget was: Java 16GB + Redis 4GB + OS 2GB = 22GB of 32GB. The batch job pushed it to 32GB.

Choose your action: - A) Move the batch job to a separate server or container - B) Rewrite the batch job to use streaming/chunked processing - C) Set memory limits (cgroups/systemd) for the batch job - D) All of the above (move now, rewrite later, cgroups as safety net)

If you chose D (recommended):¶

[Result: Immediate: move batch job to a non-production server. Medium-term: rewrite to stream data. Safety net: add systemd MemoryMax for all cron jobs. Proceed to Round 4.]

If you chose A:¶

[Result: Solves the resource contention but the batch job still has unbounded memory usage.]

If you chose B:¶

[Result: Best long-term fix but requires development time.]

If you chose C:¶

[Result: cgroups limit prevents OOM kill of other processes but the batch job itself gets killed.]

Round 4: Remediation¶

[Pressure cue: "System stable. Close the incident."]

Actions: 1. Verify Redis is running with memory limit: redis-cli CONFIG GET maxmemory 2. Verify memory usage is healthy: free -h shows adequate free memory 3. Verify database load has returned to normal 4. Move the batch job to a separate server 5. Add memory usage alerting at 85% system memory 6. Add cgroup memory limits to all cron jobs via systemd slices

Damage Report¶

Total downtime: 0 (Redis was killed but application fell back to database)
Blast radius: 5x database load for 30 minutes; application latency increased 3x
Optimal resolution time: 10 minutes (identify memory hog -> kill batch job -> restart Redis with limits)
If every wrong choice was made: 2+ hours with repeated OOM kills and cascading database overload

Cross-References¶

Primer: Linux Memory Management
Primer: Linux Ops
Primer: Process Management
Footguns: Linux Ops

Incident Replay: OOM Killer Events¶

Setup¶

Round 1: Alert Fires¶

If you chose B (recommended):¶

If you chose A:¶

If you chose C:¶

If you chose D:¶

Round 2: First Triage Data¶

If you chose B (recommended):¶

If you chose A:¶

If you chose C:¶

If you chose D:¶

Round 3: Root Cause Identification¶

If you chose D (recommended):¶

If you chose A:¶

If you chose B:¶

If you chose C:¶

Round 4: Remediation¶

Damage Report¶

Cross-References¶

Pages that link here¶