Linux Memory Management — Street Ops¶

Real-world operational scenarios for memory problems. These are the situations that wake you up at 3 AM: OOM kills, memory leaks, swap storms, and NUMA imbalances.

Diagnosing OOM Kills¶

Scenario: Your application keeps getting killed and restarting¶

Step 1: Confirm it was an OOM kill

# Check dmesg for OOM messages
$ dmesg -T | grep -i "out of memory\|oom\|killed process"
[Thu Mar 19 03:22:15 2026] Out of memory: Killed process 12345 (java) total-vm:8388608kB, anon-rss:4194304kB, file-rss:32768kB, shmem-rss:0kB, UID:1000 pgtables:8192kB oom_score_adj:0

# Check journal for OOM events
$ journalctl -k --grep="oom|killed process" --since "24 hours ago"

# Get the full OOM dump (shows what triggered it and memory state)
$ dmesg -T | grep -A 30 "invoked oom-killer"

Step 2: Understand the OOM dump

# Key lines from the OOM dump:
Mar 19 03:22:15 server01 kernel: java invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Mar 19 03:22:15 server01 kernel: Mem-Info:
Mar 19 03:22:15 server01 kernel: active_anon:1048576 inactive_anon:524288 ...
Mar 19 03:22:15 server01 kernel: Node 0 Normal free:2048kB min:16384kB low:20480kB high:24576kB
Mar 19 03:22:15 server01 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,...
Mar 19 03:22:15 server01 kernel: [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Mar 19 03:22:15 server01 kernel: [12345]  1000 12345  8388608  4194304    8388608        0             0 java
Mar 19 03:22:15 server01 kernel: [  567]     0   567    45678    12345      98304        0         -1000 sshd
Mar 19 03:22:15 server01 kernel: Out of memory: Killed process 12345 (java) total-vm:8388608kB, anon-rss:4194304kB

Key things to extract: - Which process was killed and its memory usage (RSS) - What triggered it (the process that made the allocation request) - System memory state at the time (free, active, inactive) - oom_score_adj — was the process protected?

Step 3: Prevent recurrence

# Option A: Give the system more memory/swap
$ sudo fallocate -l 4G /swapfile2
$ sudo chmod 600 /swapfile2
$ sudo mkswap /swapfile2
$ sudo swapon /swapfile2

# Option B: Limit the application's memory (cgroups/systemd)
$ sudo systemctl edit myapp.service
# Add:
[Service]
MemoryMax=4G
MemoryHigh=3G

# Option C: Protect critical services from OOM
$ sudo systemctl edit sshd.service
[Service]
OOMScoreAdjust=-900

# Option D: Fix the application's memory leak (see below)

Tuning Swappiness¶

Scenario: Database server has high latency during memory pressure¶

# Check current swappiness
$ cat /proc/sys/vm/swappiness
60    # Default — too aggressive for database servers

# Check if the system is actively swapping
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  1 524288  65536  32768 2097152  256  512  1024  2048 1500 3000 30  5 50 15  0
#                                   ^^^  ^^^
#                                   si/so > 0 means active swapping

# Check swap usage per process
$ for pid in /proc/[0-9]*; do
    p=$(basename $pid)
    swap=$(awk '/Swap:/ {sum+=$2} END {print sum}' "$pid/smaps" 2>/dev/null)
    [ "$swap" -gt 0 ] 2>/dev/null && echo "$swap kB  PID=$p  $(cat $pid/comm 2>/dev/null)"
done | sort -rn | head -10

# Tune swappiness
$ sudo sysctl vm.swappiness=10

# Make permanent
$ echo "vm.swappiness=10" | sudo tee /etc/sysctl.d/60-swappiness.conf
$ sudo sysctl --system

# For databases, also consider:
$ sudo sysctl vm.vfs_cache_pressure=50    # Less aggressive inode/dentry reclaim

Monitoring Memory Pressure (PSI)¶

Scenario: Need an early warning system for memory problems¶

# Check current memory pressure
$ cat /proc/pressure/memory
some avg10=2.50 avg60=1.20 avg300=0.80 total=98765432
full avg10=0.10 avg60=0.05 avg300=0.02 total=1234567

# Interpretation:
# some avg10=2.50 → 2.5% of the last 10s, at least one task was stalled on memory
# full avg10=0.10 → 0.1% of the last 10s, ALL tasks were stalled
# Warning thresholds: some > 10%, full > 1% indicates significant pressure

# Per-cgroup pressure (containers)
$ cat /sys/fs/cgroup/system.slice/docker-abc123.scope/memory.pressure
some avg10=15.00 avg60=8.50 avg300=3.20 total=456789012
# This specific container is under memory pressure

# Monitor in real-time
$ watch -n 1 'echo "=== System ===" && cat /proc/pressure/memory && echo && echo "=== CPU ===" && cat /proc/pressure/cpu && echo && echo "=== IO ===" && cat /proc/pressure/io'

# Set up PSI-based monitoring trigger (kernel 5.2+)
# This creates a file descriptor that triggers when pressure exceeds threshold
# Useful for custom monitoring scripts

Finding Memory Leaks¶

Scenario: Application RSS keeps growing, suspected memory leak¶

# Track RSS growth over time
$ while true; do
    rss=$(ps -o rss= -p 12345)
    echo "$(date +%H:%M:%S) RSS: ${rss} kB"
    sleep 60
done | tee /tmp/rss-track.log

# Get detailed memory map
$ pmap -x 12345 | sort -k2 -rn | head -20
# Look for anonymous mappings that are growing

# Detailed view with smaps
$ cat /proc/12345/smaps | awk '
/^[0-9a-f]/ { region=$0 }
/^Rss:/ { rss=$2; if(rss > 10240) print rss " kB  " region }
' | sort -rn | head -20

# Compare smaps over time to find growing regions
$ cat /proc/12345/smaps_rollup
Rss:             2097152 kB
Pss:             2000000 kB
Shared_Clean:      65536 kB
Shared_Dirty:       4096 kB
Private_Clean:     16384 kB
Private_Dirty:   2011136 kB     # <-- High private dirty = heap growth
Swap:                  0 kB

# For Java apps, use jmap/jcmd
$ jcmd 12345 GC.heap_info
$ jmap -histo 12345 | head -20

# For native leaks, use valgrind (development) or heaptrack
$ heaptrack -p 12345
# Let it run for a while, then analyze
$ heaptrack_gui heaptrack.12345.*.zst

# Quick check: is it actually leaked or just cached?
$ cat /proc/12345/status | grep -E "VmRSS|VmSwap|VmData|VmStk"
VmRSS:    2097152 kB
VmSwap:         0 kB
VmData:   4194304 kB    # <-- Virtual data segment (heap)
VmStk:       8192 kB
# If VmData >> VmRSS, the process allocated a lot but isn't using it all

cgroup Memory Debugging in Containers¶

Scenario: Container keeps getting OOMKilled despite having "enough" memory¶

# Check container's memory limit and current usage
$ docker stats --no-stream mycontainer
CONTAINER ID   NAME          CPU %   MEM USAGE / LIMIT   MEM %
abc123         mycontainer   15.2%   480MiB / 512MiB     93.75%

# Detailed cgroup stats
$ CGROUP_PATH=$(docker inspect mycontainer --format '{{.State.Pid}}' | xargs -I{} cat /proc/{}/cgroup | grep memory | cut -d: -f3)
# For cgroup v2:
$ docker inspect mycontainer --format '{{.State.Pid}}'
12345
$ cat /proc/12345/cgroup
0::/system.slice/docker-abc123.scope

$ cat /sys/fs/cgroup/system.slice/docker-abc123.scope/memory.current
503316480   # ~480 MB

$ cat /sys/fs/cgroup/system.slice/docker-abc123.scope/memory.max
536870912   # 512 MB limit

# Check what's using the memory INSIDE the cgroup
$ cat /sys/fs/cgroup/system.slice/docker-abc123.scope/memory.stat
anon 450000000                    # Anonymous (heap) memory
file 50000000                     # Page cache
kernel 3000000                    # Kernel memory (slab, page tables, etc.)
slab 2000000                      # Slab allocations
sock 500000                       # Socket buffers
# ... more fields

# The kernel memory counts toward the limit too!
# anon + file + kernel = total → if close to limit, OOM risk

# Check for OOM events in this cgroup
$ cat /sys/fs/cgroup/system.slice/docker-abc123.scope/memory.events
low 0
high 15                           # Hit memory.high 15 times (throttled)
max 3                             # Hit memory.max 3 times (OOM events)
oom 2                             # OOM killer invoked 2 times
oom_kill 2                        # Processes killed 2 times
oom_group_kill 0

# Fix: increase the limit or reduce usage
$ docker update --memory=1g --memory-swap=1g mycontainer

Common gotcha: kernel memory eating into the container limit¶

# Containers with many small files or network connections
# can have high kernel (slab) memory usage
$ cat /sys/fs/cgroup/system.slice/docker-abc123.scope/memory.stat | grep kernel
kernel 52428800    # 50 MB of kernel memory!

# This counts toward the memory.max limit
# If your app uses 460 MB and kernel uses 50 MB, you need > 510 MB limit

NUMA Imbalance Troubleshooting¶

Scenario: Database performance degrades on a multi-socket server¶

# Check NUMA topology
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 65536 MB
node 0 free: 8192 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 65536 MB
node 1 free: 45056 MB    # <-- Huge imbalance! Node 1 is barely used

# Check NUMA stats
$ numastat
                           node0           node1
numa_hit                 89012345         12345678
numa_miss                       0         5678901     # <-- Remote allocations!
numa_foreign               5678901               0

# Check per-process NUMA allocation
$ numastat -p $(pgrep postgres)
Per-node process memory usage (in MBs) for PID 12345 (postgres)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                      2048.00          128.00         2176.00
Stack                        0.12            0.00            0.12
Private                  50000.00         5000.00        55000.00
# Almost all memory on Node 0 — if postgres CPUs span both nodes, remote access is slow

# Fix: bind the database to one NUMA node
$ sudo numactl --cpunodebind=0 --membind=0 /usr/bin/postgres ...

# Or in systemd:
$ sudo systemctl edit postgresql.service
[Service]
ExecStart=
ExecStart=/usr/bin/numactl --cpunodebind=0 --membind=0 /usr/bin/postgres -D /var/lib/postgresql/data

# For a more balanced approach with interleaving:
$ sudo numactl --interleave=all /usr/bin/postgres ...

Clearing Page Cache Safely¶

Gotcha: High buff/cache in free -h is not a memory problem. Linux deliberately uses free RAM for page cache because unused RAM is wasted RAM. The available column is what matters -- it shows how much memory can be reclaimed for applications. Operators who routinely drop caches "to free memory" are actively hurting performance.

Scenario: Need to benchmark disk I/O without page cache interference¶

# ALWAYS sync first to flush dirty pages to disk
$ sync

# Drop page cache only (safest)
$ echo 1 | sudo tee /proc/sys/vm/drop_caches

# Drop dentries and inodes too
$ echo 2 | sudo tee /proc/sys/vm/drop_caches

# Drop everything
$ echo 3 | sudo tee /proc/sys/vm/drop_caches

# Verify
$ free -h
# buff/cache should be much lower now

When it's safe: - Benchmarking (need cold cache) - Investigating memory usage without cache noise - Emergency: system is thrashing and you need to free memory fast

When it's NOT safe: - Routinely in production (cache will rebuild, causing I/O storm) - As a "fix" for high memory usage (cache IS supposed to use memory)

Emergency: System is Thrashing (Swap Storm)¶

Scenario: System is extremely slow, load average is 50+, everything swapping¶

# Confirm it's a swap storm
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3 45 4194304  16384   1024  32768 8192 4096  8192  4096 5000 2000  5 10  0 85  0
#    ^^                              ^^^^  ^^^^                            ^^
#    blocked processes               heavy swapping                        I/O wait

# sar can show historical swap activity
$ sar -W 1 5
12:00:01 AM  pswpin/s pswpout/s
12:00:02 AM   8192.00   4096.00    # <-- Thousands of pages/sec

# Find which processes are consuming the most memory
$ ps aux --sort=-%mem | head -10

# Find which processes have the most swap
$ for pid in /proc/[0-9]*; do
    p=$(basename $pid)
    swap=$(awk '/VmSwap:/ {print $2}' "$pid/status" 2>/dev/null)
    [ "${swap:-0}" -gt 0 ] && echo "$swap kB  $p  $(cat $pid/comm 2>/dev/null)"
done | sort -rn | head -10

# Emergency actions (in order of aggression):
# 1. Kill the obvious memory hog
$ kill 12345    # Graceful
$ kill -9 12345 # If graceful doesn't work

# 2. Drop page cache to free some breathing room
$ sync && echo 3 > /proc/sys/vm/drop_caches

# 3. Temporarily reduce swappiness to stop swap-in/swap-out churn
$ sysctl vm.swappiness=0

# 4. If you can identify the process, use cgroup to limit it
$ systemctl set-property myapp.service MemoryMax=2G

# 5. Last resort: add emergency swap
$ sudo fallocate -l 4G /emergency-swap
$ sudo chmod 600 /emergency-swap
$ sudo mkswap /emergency-swap
$ sudo swapon -p -1 /emergency-swap   # Low priority

Transparent Huge Pages Causing Database Latency¶

Scenario: Redis/MongoDB has periodic latency spikes (every few seconds to minutes)¶

# Check THP status
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

# Check compaction activity (THP needs contiguous memory)
$ grep -i compact /proc/vmstat
compact_stall 12345          # <-- Process was stalled waiting for compaction
compact_success 8000
compact_fail 4345
compact_pages_moved 5678901

# If compact_stall is high, THP compaction is causing latency

# Check specific to Redis
$ redis-cli info memory | grep transparent_hugepage
# Redis will warn if THP is enabled

# Disable THP
$ echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
$ echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

# Make persistent across reboots
$ cat <<'EOF' | sudo tee /etc/systemd/system/disable-thp.service
[Unit]
Description=Disable Transparent Huge Pages (THP)
DefaultDependencies=no
After=sysinit.target local-fs.target
Before=basic.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled && echo never > /sys/kernel/mm/transparent_hugepage/defrag'

[Install]
WantedBy=basic.target
EOF

$ sudo systemctl daemon-reload
$ sudo systemctl enable disable-thp.service

# Verify latency improvement
$ redis-cli --latency -h localhost
# Before: avg 0.5ms, max 150ms
# After:  avg 0.3ms, max 2ms

Reading /proc/meminfo Under Pressure¶

Scenario: Monitoring alert says "low memory" — is it real?¶

# Quick assessment
$ free -h
              total        used        free      shared  buff/cache   available
Mem:           62Gi        58Gi       256Mi       128Mi        4Gi        3.5Gi
Swap:         7.8Gi       2.1Gi       5.7Gi

# available = 3.5Gi → system has 3.5 GB before it's truly in trouble
# Swap used = 2.1 GB → some swapping, but could be idle pages

# Is the swap usage active or stale?
$ vmstat 1 5
# If si/so are 0, swap contains stale pages — not a problem
# If si/so are nonzero, active swapping — investigate

# Detailed breakdown
$ awk '
/MemTotal:/     {total=$2}
/MemAvailable:/ {avail=$2}
/Buffers:/      {buf=$2}
/^Cached:/      {cache=$2}
/SwapTotal:/    {stotal=$2}
/SwapFree:/     {sfree=$2}
/Slab:/         {slab=$2}
/SReclaimable:/ {srec=$2}
END {
    printf "Total:       %8d MB\n", total/1024
    printf "Available:   %8d MB (%.1f%%)\n", avail/1024, avail*100/total
    printf "Buffers:     %8d MB\n", buf/1024
    printf "Cache:       %8d MB\n", cache/1024
    printf "Slab:        %8d MB (reclaimable: %d MB)\n", slab/1024, srec/1024
    printf "Swap used:   %8d MB / %d MB\n", (stotal-sfree)/1024, stotal/1024
}
' /proc/meminfo

# If available is > 10% of total, the system is fine
# If available is 1-5%, start investigating
# If available is < 1%, take action now