Skip to content

Thinking Out Loud: Linux Memory Management

A senior SRE's internal monologue while working through a real memory management issue. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

An overnight batch job on a 32GB worker VM was killed by the OOM killer. The job processes CSV files, transforms them, and loads them into a database. The team says "it worked fine with last month's data but this month it got OOM-killed." The data set grew from 8GB to 12GB.

The Monologue

OOM-killed batch job. First thing — let me find the OOM event and see exactly what happened.

dmesg -T | grep -i "oom\|killed" | tail -20

"Out of memory: Killed process 28451 (python3) total-vm:48229376kB, anon-rss:30982144kB, file-rss:1245696kB oom_score_adj:0." So the Python process had 29.5GB RSS (anonymous — heap memory) plus 1.2GB file-backed pages. On a 32GB machine, that's pretty much everything.

Wait — 48GB virtual memory on a 32GB machine? That's fine — virtual memory doesn't mean physical memory. The important number is anon-rss at 29.5GB. The Python process was using almost all physical RAM.

Let me check what else was running that competed for memory.

dmesg -T | grep "oom" | grep "Mem-Info" -A 20 | head -30

Mental Model: Understanding the OOM Killer's Decision

The OOM killer doesn't randomly kill processes. It calculates an oom_score for each process based on RSS usage, oom_score_adj, and whether it's a root process. It kills the process with the highest score. To understand why YOUR process was killed, you need to know what the memory landscape looked like at the moment of the kill — not after.

The OOM event shows total memory: 32GB, free: 45MB, cached: 280MB. So at the moment of the kill, the machine was completely out of memory. The Python process was the biggest consumer by far, so the OOM killer targeted it. Makes sense.

Now, why is a 12GB CSV file causing 29.5GB of RAM usage? The data grew by 50% (8GB to 12GB) but the memory usage didn't — it probably always consumed way more than the file size. Let me look at the script.

cat /opt/etl/transform.py | head -40

df = pd.read_csv(sys.argv[1]) — they're loading the entire CSV into a pandas DataFrame. A 12GB CSV file in pandas will easily consume 3-5x the file size in memory because: (1) string columns create Python objects per cell, (2) pandas stores metadata, indexes, and type info, (3) any transformations create copies.

So 12GB CSV x 3-5x multiplier = 36-60GB memory. On a 32GB machine. That's the problem. It was probably marginal with the 8GB file too (24-40GB), but just barely fit.

Let me verify by checking if the job had swap available.

free -h
swapon --show

No swap configured. So once physical memory was exhausted, the OOM killer had no choice. If there had been swap, the job might have completed (slowly) instead of dying. But adding swap is a bandaid — the real fix is to process the data in chunks.

Let me think about what to recommend. The quick fix and the proper fix are different:

Quick fix: Add swap and increase the VM size. Gets the job running tonight. Proper fix: Rewrite to use chunked processing — pd.read_csv(..., chunksize=100000).

I'll do both. Let me set up swap first to unblock the team.

sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

16GB swap gives us a 48GB total memory space. That should handle the current data size. But I'm also going to set vm.overcommit_memory to be explicit about what happens.

cat /proc/sys/vm/overcommit_memory

It's 0 (heuristic overcommit). That's fine for most cases. If I set it to 2 (strict), the kernel would refuse allocations beyond RAM+swap, which would cause the Python process to get a MemoryError instead of being OOM-killed. That's actually more useful — a MemoryError is catchable, an OOM kill is not.

Mental Model: Overcommit Modes

Linux has three overcommit modes: 0 (heuristic — kernel guesses), 1 (always — never refuse, YOLO), 2 (strict — refuse allocations beyond RAM+swap, based on overcommit_ratio). For batch processing servers, mode 2 with a sensible ratio is better because you get catchable errors instead of OOM kills. For web servers, mode 0 is usually fine because fork() temporarily doubles virtual memory and mode 2 would block it.

echo 'vm.overcommit_memory=2' | sudo tee -a /etc/sysctl.d/99-memory.conf
echo 'vm.overcommit_ratio=90' | sudo tee -a /etc/sysctl.d/99-memory.conf
sudo sysctl -p /etc/sysctl.d/99-memory.conf

That allows allocation up to 90% of RAM + all swap = (320.9) + 16 = 44.8GB of commit limit. Should be enough for the current data size.*

Now let me also look at what the Python script is actually doing with the data. Maybe the chunked processing fix is simpler than I think.

cat /opt/etl/transform.py

It reads the whole CSV, does a groupby, a merge with a reference table, and writes to the database. The groupby requires the whole dataset in memory, but the merge and write could be chunked. The core issue is the groupby — you can't group-by on a streaming chunk unless you use a different algorithm.

For the groupby use case, the proper approach is either: use dask (pandas-like but out-of-core), use SQLite as an intermediate store, or do the groupby in the database itself (load raw, aggregate in SQL). Let me recommend the database approach since they're loading into PostgreSQL anyway.

For tonight though, the swap should be enough. Let me re-run the job and monitor it.

sudo -u etluser /opt/etl/run.sh &
watch -n 5 'free -h; echo; ps -p $(pgrep -f transform.py) -o rss,vsz,%mem --no-headers 2>/dev/null || echo "Process not found"'

Mental Model: Monitor the Re-Run

Never re-run a failed job without watching it. You need to verify your fix works AND understand the actual memory profile. Is it a steady climb (leak) or a spike (peak allocation)? Does it stabilize or keep growing? This data shapes your long-term recommendation.

RSS climbing... 8GB... 15GB... 22GB... starting to use swap at 28GB... 32GB RSS, 8GB swap. It's working but grinding through swap. Let me check if it's making progress.

The job is processing. Slow due to swap, but it'll finish. Estimated 45 minutes instead of the usual 15. For tonight, that's acceptable. Tomorrow we resize the VM to 64GB and start the chunked processing rewrite.

What Made This Senior-Level

Junior Would... Senior Does... Why
See "OOM killed" and immediately increase VM size Investigate the memory profile to understand WHY the job uses so much memory Throwing hardware at it works but hides the real problem (unbounded memory usage)
Not read the dmesg OOM details Parse the OOM killer output to understand anon-rss, file-rss, and oom_score The kernel tells you exactly what happened — you just have to read it
Add swap and call it done Add swap as a stopgap, change overcommit mode for better error handling, AND recommend a code change for the proper fix Each layer addresses a different failure mode
Re-run the job without monitoring Watch memory usage during the re-run to verify the fix and understand the profile You need to validate your fix and gather data for the long-term solution

Key Heuristics Used

  1. Read the OOM Log: The kernel logs the exact memory state at the moment of the kill — RSS, virtual size, oom_score, and system-wide memory statistics. This is your primary diagnostic data.
  2. Overcommit Mode Matters: Mode 0 (default) allows overcommit but OOM-kills. Mode 2 (strict) refuses allocations, giving catchable errors. Choose based on workload type.
  3. Quick Fix Then Proper Fix: Add swap and resize to unblock today. Rewrite to use chunked processing or in-database aggregation for the long-term solution.

Cross-References

  • Primer — Virtual memory, page cache, swap, and the OOM killer fundamentals
  • Street Ops — Memory debugging commands and OOM analysis workflows
  • Footguns — pandas loading entire files into memory, no swap on worker VMs, and default overcommit behavior