Linux Memory Management — Footguns¶

[!WARNING] These mistakes cause OOM kills, cascading slowness, and mysterious latency. Every item here has taken down production systems or created problems that took days to diagnose.

1. Disabling swap entirely without understanding the consequences¶

You read that Kubernetes requires swap off, or that "swap is bad for databases," so you disable it on everything. Now when a memory spike hits, instead of gracefully swapping out idle pages, the OOM killer fires immediately and takes down your most important process.

# What you did:
swapoff -a
# Removed swap entries from /etc/fstab

# What happens when memory fills:
# Without swap: OOM killer fires immediately. No buffer.
# With swap: Idle pages get swapped, buying time to investigate.

# When disabling swap is correct:
# - Kubernetes nodes (kubelet requires it, or use swap with caution on 1.28+)
# - Systems where latency from swap is worse than a restart

# When disabling swap is wrong:
# - General-purpose servers
# - Batch processing systems that have occasional memory spikes
# - Dev machines

# Compromise — keep swap but use it minimally:
sysctl vm.swappiness=10    # Swap only under real pressure
# And monitor swap usage:
swapon --show
free -h | grep Swap

2. Setting overcommit_memory=2 in the wrong context¶

Strict overcommit (vm.overcommit_memory=2) means the kernel never promises more memory than physically available (RAM + swap * ratio). Sounds safe. But many applications — especially those that fork() — will fail to allocate memory even when the system has plenty free. Redis background saves, Apache/Nginx worker forks, and any process that uses fork+exec can break.

# The problem:
sysctl vm.overcommit_memory=2
sysctl vm.overcommit_ratio=50

# A 4GB RSS Redis process tries to fork for BGSAVE.
# fork() needs to reserve 4GB of virtual address space for the child
# (even though copy-on-write means it won't use it).
# With strict overcommit, the kernel refuses: "Cannot allocate memory"

# Redis log:
# Can't save in background: fork: Cannot allocate memory

# Fix for Redis specifically:
sysctl vm.overcommit_memory=1    # Redis recommendation

# For most servers, the default is fine:
sysctl vm.overcommit_memory=0    # Heuristic overcommit (default)

# If you must use mode=2, increase the ratio:
sysctl vm.overcommit_ratio=80    # Allow commit up to 80% of RAM + swap

# Check how close you are to the commit limit:
grep -E "CommitLimit|Committed_AS" /proc/meminfo
# CommitLimit:    50000000 kB
# Committed_AS:   30000000 kB   <-- if this exceeds CommitLimit, new allocs fail

3. Ignoring cgroup memory limits¶

You set resources.limits.memory=512Mi on a Kubernetes pod or MemoryMax=512M in a systemd unit. The app allocates 600MB. You expect the kernel to prevent the allocation. Instead, the kernel lets the allocation succeed, then the OOM killer fires and terminates the process. Cgroup limits don't prevent allocation — they trigger OOM when exceeded.

# Common misunderstanding:
# "I set a 512MB limit, so the app can't use more than 512MB"
# Reality: the app CAN allocate more. When it tries to USE it, OOM fires.

# What actually helps — use memory.high as a soft limit:
# systemd:
# [Service]
# MemoryHigh=400M    # Throttle at 400MB (slow down, don't kill)
# MemoryMax=512M     # Hard kill at 512MB

# Kubernetes:
# resources:
#   requests:
#     memory: "400Mi"   # Scheduling hint
#   limits:
#     memory: "512Mi"   # OOM kill threshold

# Monitor cgroup usage BEFORE it hits the limit:
cat /sys/fs/cgroup/.../memory.current
cat /sys/fs/cgroup/.../memory.max
cat /sys/fs/cgroup/.../memory.events
# Check 'high' and 'max' event counts — they tell you how often
# the cgroup hit its limits.

# The kernel memory trap:
# Kernel memory (page tables, slab, socket buffers) counts toward the limit.
# An app that opens thousands of network connections can hit its limit
# from kernel memory alone, not application heap.
cat /sys/fs/cgroup/.../memory.stat | grep kernel

4. Misreading `free` output — panicking about low "free" memory¶

The free column in free -h shows truly unused memory. On a healthy Linux system, this is low — because the kernel uses free memory for page cache, which speeds up disk reads. Low "free" is normal. "Available" is what matters.

free -h
#               total   used   free   shared  buff/cache  available
# Mem:           64G    12G    2G     256M    50G         48G

# WRONG reaction: "Only 2G free! We need more RAM!"
# RIGHT reading: "48G available. The 50G in buff/cache is reclaimable. We're fine."

# When to actually worry:
# - 'available' < 10% of total
# - 'available' declining steadily over time
# - Swap 'used' growing while 'available' shrinks

# This confusion leads to:
# - Buying more RAM unnecessarily
# - Adding monitoring alerts that fire constantly on healthy systems
# - Running 'echo 3 > /proc/sys/vm/drop_caches' in cron (terrible idea)

# Correct monitoring threshold:
# Alert on: (available / total) < 10%
# Not on: (free / total) < some_threshold

5. Killing the wrong process after OOM¶

The OOM killer log says "Killed process 12345 (java)." You see Java is eating all the memory and kill all Java processes. But the Java process was actually your mission-critical application server. The OOM was triggered by a runaway log parser that allocated 50GB, but the kernel killed Java because it had the highest OOM score. You just made the outage worse.

# The OOM killer doesn't necessarily kill the process that caused
# the memory exhaustion. It kills the process with the highest
# oom_score (which considers RSS, oom_score_adj, and other factors).

# Read the FULL OOM dump, not just "Killed process X":
dmesg -T | grep -B20 "Killed process"
# Look at the process table dump — which process had the most RSS?
# Which process was growing?

# Check oom_score for running processes:
for pid in /proc/[0-9]*/; do
    p=$(basename "$pid")
    score=$(cat "${pid}oom_score" 2>/dev/null)
    name=$(cat "${pid}comm" 2>/dev/null)
    [ -n "$score" ] && [ "$score" -gt 100 ] && echo "$score $p $name"
done | sort -rn | head -10

# Protect critical processes:
# In the systemd unit:
# [Service]
# OOMScoreAdjust=-900

# Make the known-leaky process die first:
echo 1000 > /proc/$(pgrep -x log-parser)/oom_score_adj

6. Transparent Huge Pages (THP) latency spikes in databases¶

THP is enabled by default on most Linux distributions. It causes the kernel to automatically use 2MB pages instead of 4KB pages. This is great for throughput-oriented workloads but causes unpredictable latency spikes for databases (Redis, MongoDB, PostgreSQL) because the kernel must find contiguous 2MB regions, triggering memory compaction that stalls allocations.

# Check THP status:
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never
# 'always' = THP active for all processes (bad for databases)

# Check compaction stalls:
grep compact_stall /proc/vmstat
# compact_stall 8234   <-- processes stalled waiting for compaction

# Symptoms:
# - Redis: periodic 50-200ms latency spikes
# - MongoDB: write pauses every few seconds
# - PostgreSQL: random query latency spikes under load

# Fix — disable THP:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# Make persistent via kernel parameter (in /etc/default/grub):
# GRUB_CMDLINE_LINUX="transparent_hugepage=never"
# Then: update-grub && reboot

# Or via systemd unit:
cat <<'EOF' > /etc/systemd/system/disable-thp.service
[Unit]
Description=Disable Transparent Huge Pages
DefaultDependencies=no
After=sysinit.target local-fs.target
Before=basic.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled && echo never > /sys/kernel/mm/transparent_hugepage/defrag'

[Install]
WantedBy=basic.target
EOF
systemctl daemon-reload && systemctl enable disable-thp.service

7. Not monitoring memory.swap.current in cgroups¶

You set memory limits on containers but don't monitor swap usage inside the cgroup. The container hits its memory limit, starts swapping, and performance collapses. The container doesn't get OOM-killed because it's technically within limits (swap counts separately unless memory.swap.max=0).

# Check swap usage in a cgroup:
cat /sys/fs/cgroup/.../memory.swap.current
# 2147483648   <-- 2GB in swap! Performance is terrible.

cat /sys/fs/cgroup/.../memory.swap.max
# max          <-- unlimited swap! Should probably be limited.

# The problem: container appears "healthy" (not OOM-killed)
# but is swapping heavily and extremely slow.

# Fix: limit swap in the cgroup:
echo 0 > /sys/fs/cgroup/.../memory.swap.max    # No swap allowed

# Kubernetes:
# Set memory limit = memory request to prevent swap:
# resources:
#   requests:
#     memory: "512Mi"
#   limits:
#     memory: "512Mi"

# Docker:
docker run --memory=512m --memory-swap=512m myapp
# --memory-swap equal to --memory means no swap

# Monitor both:
echo "Memory: $(cat /sys/fs/cgroup/.../memory.current) / $(cat /sys/fs/cgroup/.../memory.max)"
echo "Swap:   $(cat /sys/fs/cgroup/.../memory.swap.current) / $(cat /sys/fs/cgroup/.../memory.swap.max)"

8. Setting vm.min_free_kbytes too low or too high¶

vm.min_free_kbytes controls the minimum amount of memory the kernel keeps free for emergency allocations (network packets, interrupt handlers, etc.). Too low: the system can deadlock under memory pressure because the kernel can't allocate memory to free memory. Too high: you waste RAM that applications could use, and the kernel becomes overly aggressive about reclaiming.

# Check current value:
cat /proc/sys/vm/min_free_kbytes
# 67584   (66MB — default on a 64GB system)

# Too low (causes issues):
sysctl vm.min_free_kbytes=1024    # 1MB — dangerously low
# Risk: network driver can't allocate buffers -> dropped packets
# Risk: kernel can't allocate memory to perform reclaim -> deadlock

# Too high (wastes memory):
sysctl vm.min_free_kbytes=4194304  # 4GB — way too high on a 64GB system
# This forces the kernel to keep 4GB free at all times
# Triggers aggressive reclaim, more swapping, worse cache performance

# Reasonable range:
# 64GB RAM:  65536-262144  (64MB-256MB)
# 16GB RAM:  32768-131072  (32MB-128MB)
# Increase if you see network packet drops under memory pressure

# For 10Gbps+ networking, increase:
sysctl vm.min_free_kbytes=262144   # 256MB

# Persist:
echo "vm.min_free_kbytes=262144" >> /etc/sysctl.d/99-memory.conf

9. Dropping page cache in production on a schedule¶

You add echo 3 > /proc/sys/vm/drop_caches to a cron job because you want to "free up memory." Every time it runs, the system dumps all page cache. The database has to re-read every file from disk. I/O spikes. Latency spikes. Throughput drops. The cache rebuilds over minutes, then you drop it again.

# WRONG — never put this in cron:
# */30 * * * * sync && echo 3 > /proc/sys/vm/drop_caches

# Page cache is SUPPOSED to use all free memory. That's the design.
# "Free" memory is wasted memory — the kernel keeps it as cache for speed.

# The only legitimate reasons to drop caches:
# 1. Benchmarking (need cold cache for accurate measurements)
# 2. One-time emergency (need to free memory NOW)
# 3. Debugging (investigating memory behavior)

# If you're running out of memory, the real fixes are:
# - Find and fix the memory leak
# - Add more RAM
# - Set cgroup limits on greedy processes
# - Tune vm.swappiness and vm.vfs_cache_pressure

10. Misunderstanding OOM killer protection (-1000)¶

You protect everything important with OOMScoreAdjust=-1000. Database: -1000. App server: -1000. Cache: -1000. When OOM fires, the kernel has nothing it's "allowed" to kill, so it kills sshd, systemd-journald, or something else critical to system management. Now you can't SSH in to fix the problem.

# Bad: protecting everything
# [Service]
# OOMScoreAdjust=-1000    # on database
# OOMScoreAdjust=-1000    # on app server
# OOMScoreAdjust=-1000    # on cache
# OOMScoreAdjust=-1000    # on worker

# The OOM killer MUST kill something. If all the big processes
# are protected, it kills small ones — sshd, cron, logging.
# Now you can't manage the server.

# Better: create a priority order
# [Service] for database:     OOMScoreAdjust=-900    (last to die)
# [Service] for app server:   OOMScoreAdjust=-500    (protected but expendable)
# [Service] for cache:        OOMScoreAdjust=0       (default — can be killed)
# [Service] for batch worker: OOMScoreAdjust=500     (kill this first)

# Protect SSH access:
# [Service] for sshd:         OOMScoreAdjust=-900    (always keep management access)

# Verify what will die first:
for pid in /proc/[0-9]*/; do
    p=$(basename "$pid")
    score=$(cat "${pid}oom_score" 2>/dev/null)
    adj=$(cat "${pid}oom_score_adj" 2>/dev/null)
    name=$(cat "${pid}comm" 2>/dev/null)
    rss=$(awk '/VmRSS/{print $2}' "${pid}status" 2>/dev/null)
    [ -n "$score" ] && [ "${rss:-0}" -gt 0 ] && \
        printf "%5s %5s %8s kB  %s\n" "$score" "$adj" "$rss" "$name"
done | sort -rn | head -20

11. Memory oversubscription in virtual environments¶

You provision 10 VMs with 8GB each on a host with 64GB RAM, thinking "they won't all use 8GB at once." They do. The hypervisor starts ballooning or swapping at the host level. Every VM slows to a crawl. The guest OS inside each VM has no idea why — free shows plenty of memory, but every memory access is slow because the hypervisor is paging.

# Symptoms inside the VM:
# - High steal time in top/vmstat (st column)
# - Processes are slow but CPU/memory metrics look normal
# - High %wa (I/O wait) with no obvious disk I/O

vmstat 1 5
# Look at 'st' column — stolen time from hypervisor
#  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
#  2  0      0 2048000 32768 4096000   0    0    50   100  500  800 10  2 70  3 15
#                                                                              ^^
#                                                            15% stolen!

# From the hypervisor, check memory overcommit:
# VMware: esxtop, check MCTLSZ (balloon) and SWCUR (swap)
# KVM: virsh domstats <vm> | grep balloon
# Proxmox: pvesh get /nodes/<node>/status

# Fix: don't oversubscribe memory on production hypervisors.
# Or use memory ballooning with monitoring and alerts.

12. Forgetting that tmpfs counts against memory¶

tmpfs (used for /tmp, /dev/shm, /run) stores data in RAM. Writing large files to tmpfs consumes physical memory. If an application dumps 10GB of temp files to a tmpfs mount, that's 10GB less memory for everything else.

# Check tmpfs usage:
df -h -t tmpfs
# Filesystem      Size  Used Avail Use% Mounted on
# tmpfs            32G   15G   17G  47% /dev/shm
# tmpfs            32G  1.5G   31G   5% /tmp

# 15GB used in /dev/shm — that's 15GB of RAM!

# Find what's in there:
du -sh /dev/shm/* 2>/dev/null | sort -rh | head -10
du -sh /tmp/* 2>/dev/null | sort -rh | head -10

# Limit tmpfs size:
# In /etc/fstab:
tmpfs /tmp tmpfs defaults,size=2G,noexec,nosuid 0 0
tmpfs /dev/shm tmpfs defaults,size=4G 0 0

# Or remount with a size limit:
mount -o remount,size=2G /tmp

# Common culprits:
# - Postgres temp files during large sorts
# - Application log files written to /tmp
# - Shared memory segments (posix shm) in /dev/shm
# - Docker's tmpfs mounts inside containers