Portal | Level: L1: Foundations | Topics: Linux Memory Management, Linux Fundamentals, Linux Performance Tuning | Domain: Linux
Linux Memory Management — Primer¶
Why This Matters¶
Memory is the most common resource constraint in production systems. When a container gets OOM-killed at 3 AM, when your database performance drops because Linux reclaimed its page cache, when your application's latency spikes because of transparent huge page compaction — these are all memory management problems. Understanding how Linux manages memory is not optional for anyone who operates production systems.
This primer covers the full memory landscape: how virtual memory works, what the kernel is doing behind the scenes, how to read the numbers correctly, and how the various tuning knobs affect system behavior.
Virtual Memory Fundamentals¶
Physical vs Virtual Addressing¶
Every process sees its own virtual address space. On a 64-bit system, each process theoretically has 256 TB of addressable memory (48-bit addressing in practice). The kernel's page table maps virtual addresses to physical RAM pages.
Process A sees: Physical RAM: Process B sees:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ 0x00400000 │────────▶│ Frame 1234 │◀──────│ 0x00400000 │
│ (code) │ │ │ │ (code) │
├──────────────┤ ├──────────────┤ ├──────────────┤
│ 0x7fff0000 │────────▶│ Frame 5678 │ │ 0x7fff0000 │────▶Frame 9012
│ (stack) │ │ │ │ (stack) │
└──────────────┘ └──────────────┘ └──────────────┘
Key benefits: - Isolation — processes can't read each other's memory (without explicit sharing) - Overcommit — you can allocate more virtual memory than physical RAM exists - Shared pages — multiple processes can share the same physical page (shared libraries, CoW after fork)
Pages¶
Memory is managed in pages — fixed-size blocks (4 KB on x86_64). Everything the kernel does with memory operates at page granularity. When you malloc(1), the kernel allocates at least one 4 KB page.
Page Tables¶
The page table is a hierarchical data structure that maps virtual page numbers to physical frame numbers. On x86_64, it's 4 levels deep (PGD → PUD → PMD → PTE). Each level is a page of pointers.
A TLB (Translation Lookaside Buffer) caches recent translations in hardware. TLB misses are expensive, which is why huge pages exist.
Fun fact: A typical x86_64 CPU has only 1,024-2,048 TLB entries. With 4 KB pages, that covers just 4-8 MB of memory. A process using 4 GB of RAM needs 1 million page table entries but can only cache a tiny fraction in the TLB. This is why 2 MB huge pages (covering 2-4 GB in the same TLB) can improve performance by 5-20% for memory-intensive workloads.
Reading /proc/meminfo¶
This is the most important file for understanding memory state:
$ cat /proc/meminfo
MemTotal: 65536000 kB # Total physical RAM
MemFree: 2048000 kB # Truly unused RAM
MemAvailable: 45056000 kB # Estimated available for new allocations
Buffers: 512000 kB # Block device I/O buffers
Cached: 40960000 kB # Page cache (file data in RAM)
SwapCached: 32000 kB # Swap pages also in RAM
Active: 30720000 kB # Recently accessed pages
Inactive: 25600000 kB # Not recently accessed (reclaim candidates)
Active(anon): 15360000 kB # Active anonymous pages (heap, stack)
Inactive(anon): 5120000 kB # Inactive anonymous pages
Active(file): 15360000 kB # Active file-backed pages (page cache)
Inactive(file): 20480000 kB # Inactive file-backed pages
SwapTotal: 8192000 kB # Total swap space
SwapFree: 8000000 kB # Free swap space
Dirty: 12000 kB # Pages waiting to be written to disk
Writeback: 0 kB # Pages being written right now
Slab: 3072000 kB # Kernel slab allocator
SReclaimable: 2560000 kB # Slab pages that can be reclaimed
SUnreclaim: 512000 kB # Slab pages that cannot be reclaimed
Hugepages_Total: 0 # Configured huge pages
Hugepages_Free: 0 # Unused huge pages
The Crucial Distinction: Free vs Available¶
MemFree is RAM that's literally unused — no data, no cache, nothing. On a healthy system, this is often very low. That's fine.
MemAvailable is the kernel's estimate of how much memory is available for new allocations without swapping. It includes MemFree plus reclaimable page cache and slab. This is the number you should monitor.
A system with 64 GB RAM, 2 GB MemFree, and 45 GB MemAvailable is healthy — the kernel is using RAM for caching, which makes the system faster.
Buffers vs Cached¶
- Buffers — metadata buffers for block devices (directory entries, inode tables). Small, rarely significant.
- Cached — the page cache. When you read a file, the kernel keeps the data in RAM in case you read it again. This is why
freeshows most of your RAM as "used" — it's used for caching, and will be freed on demand.
$ free -h
total used free shared buff/cache available
Mem: 62Gi 15Gi 1.9Gi 256Mi 45Gi 44Gi
Swap: 7.8Gi 0B 7.8Gi
The available column is what matters. Ignore free. The system above has 44 GB available despite only 1.9 GB "free."
Memory Zones¶
Physical RAM is divided into zones based on address ranges:
| Zone | Address Range | Purpose |
|---|---|---|
ZONE_DMA |
0-16 MB | Legacy ISA DMA |
ZONE_DMA32 |
0-4 GB | 32-bit DMA devices |
ZONE_NORMAL |
4 GB+ | Normal allocations (most memory) |
ZONE_MOVABLE |
Configurable | Pages that can be migrated (for memory hotplug) |
$ cat /proc/zoneinfo | grep -E "^Node|pages free|managed"
Node 0, zone DMA
pages free 3845
managed 3976
Node 0, zone DMA32
pages free 208456
managed 520844
Node 0, zone Normal
pages free 512000
managed 15728640
For most operators, zones are invisible — the kernel manages them automatically. They matter when debugging "out of memory in DMA zone" errors on systems that technically have plenty of RAM.
Page Reclaim and Swapping¶
How the Kernel Reclaims Memory¶
When free memory gets low, the kernel reclaims pages. The page reclaim algorithm:
- File-backed pages (page cache) — can be dropped (clean pages) or written back and dropped (dirty pages). This is cheap.
- Anonymous pages (heap, stack, mmap'd anonymous) — must be written to swap before being freed. This is expensive.
The kernel maintains two LRU (Least Recently Used) lists per zone: active and inactive. Pages move between them based on access patterns:
┌───────────────┐
Access ──────▶│ Active List │
│ (recently used)│
└───────┬───────┘
│ aging
┌───────▼───────┐
│ Inactive List │
│ (reclaim cand.)│
└───────┬───────┘
│ reclaim
┌───────▼───────┐
│ Free Pool │
└───────────────┘
Swappiness¶
The vm.swappiness parameter (0-200, default 60) controls the kernel's tendency to swap anonymous pages vs reclaim file-backed pages.
# View current swappiness
$ cat /proc/sys/vm/swappiness
60
# Set temporarily
$ sudo sysctl vm.swappiness=10
# Set permanently
$ echo "vm.swappiness=10" | sudo tee -a /etc/sysctl.d/99-memory.conf
$ sudo sysctl --system
| Value | Behavior |
|---|---|
| 0 | Avoid swapping anonymous pages as much as possible (still swaps to avoid OOM) |
| 10 | Low swap tendency (common for database servers) |
| 60 | Default balance |
| 100 | Equal treatment of anonymous and file-backed pages |
| 200 | (cgroup v2 only) Aggressively swap anonymous pages |
For database servers, set swappiness to 10 or lower — databases manage their own caching and the page cache is less valuable.
Swap Space¶
Swap provides overflow for anonymous memory. When physical RAM is exhausted, the kernel writes anonymous pages to swap, freeing RAM for active use.
# View swap usage
$ swapon --show
NAME TYPE SIZE USED PRIO
/swapfile file 8G 192M -2
# Create a swap file
$ sudo fallocate -l 4G /swapfile
$ sudo chmod 600 /swapfile
$ sudo mkswap /swapfile
$ sudo swapon /swapfile
# Add to fstab for persistence
$ echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
The OOM Killer¶
Under the hood: The OOM killer was added to the Linux kernel by Rik van Riel in the early 2000s. It is intentionally a last resort — the kernel tries hard to avoid it by reclaiming page cache, writing dirty pages, and swapping. When the OOM killer fires, it means all those mechanisms were exhausted.
When the system is truly out of memory (RAM + swap), the kernel invokes the OOM (Out Of Memory) killer. It selects a process to kill based on a scoring algorithm.
OOM Score¶
Every process has an OOM score in /proc/<pid>/oom_score (0-1000). Higher = more likely to be killed. The score is based on:
- Memory usage (dominant factor)
- CPU time consumed
- Number of child processes
- Whether the process is privileged (root gets a discount)
# View OOM scores for all processes
$ ps -eo pid,oom_score,comm --sort=-oom_score | head -10
PID OOM_SCORE COMMAND
1234 450 java
2345 200 python3
3456 150 node
Adjusting OOM Score¶
# Make a process less likely to be killed (-1000 to 1000)
$ echo -500 | sudo tee /proc/1234/oom_score_adj
# Make a process immune to OOM killer (DANGEROUS)
$ echo -1000 | sudo tee /proc/1234/oom_score_adj
# In a systemd unit:
[Service]
OOMScoreAdjust=-900
Detecting OOM Kills¶
# Check dmesg for OOM events
$ dmesg | grep -i "out of memory\|oom-killer\|killed process"
[12345.678901] Out of memory: Killed process 1234 (java) total-vm:8388608kB, anon-rss:4194304kB
# Check journal
$ journalctl -k --grep="oom|killed process"
Overcommit Settings¶
By default, Linux allows processes to allocate more virtual memory than physical RAM + swap. This is called overcommit.
| Value | Behavior |
|---|---|
| 0 | Heuristic (default) — kernel uses heuristics to decide if allocation is "reasonable." Most allocations succeed. |
| 1 | Always — always allow allocations, never refuse. malloc() never fails. OOM killer handles the consequences. |
| 2 | Never — total virtual memory is limited to swap + (RAM × overcommit_ratio/100). malloc() fails when limit reached. |
With mode 2, the overcommit ratio controls the limit:
$ cat /proc/sys/vm/overcommit_ratio
50
# Meaning: max virtual memory = swap + (RAM × 50%)
# With 64 GB RAM and 8 GB swap: max = 8 + 32 = 40 GB
Mode 0 (default) is correct for most workloads. Mode 2 is used in scientific computing where you'd rather have malloc() fail than have the OOM killer randomly kill processes.
Huge Pages¶
Standard 4 KB pages mean a 1 GB mapping requires 262,144 page table entries and TLB entries. Huge pages use larger page sizes to reduce this overhead.
Transparent Huge Pages (THP)¶
The kernel automatically merges contiguous 4 KB pages into 2 MB huge pages:
# Check THP status
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
# Disable THP (recommended for databases)
$ echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
$ echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
THP provides automatic performance improvement for many workloads, but can cause latency spikes in databases (Redis, MongoDB, PostgreSQL) due to page compaction. Most database documentation recommends disabling THP.
Explicit Huge Pages¶
Pre-allocated, reserved huge pages:
# Allocate 1024 × 2MB huge pages (2 GB total)
$ echo 1024 | sudo tee /proc/sys/vm/nr_hugepages
# Or at boot via kernel parameter
$ cat /proc/cmdline
... hugepages=1024 ...
# View huge page status
$ cat /proc/meminfo | grep -i huge
HugePages_Total: 1024
HugePages_Free: 1024
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
# Mount a hugetlbfs for application use
$ mount -t hugetlbfs none /mnt/hugepages
Applications like DPDK, large JVMs, and databases use explicit huge pages for predictable performance.
NUMA (Non-Uniform Memory Access)¶
On multi-socket servers, each CPU socket has its own local memory. Accessing local memory is fast; accessing remote memory (on another socket) is slower.
# Check NUMA topology
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32768 MB
node 0 free: 16384 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 8192 MB
node distances:
node 0 1
0: 10 21
1: 21 10
# View per-NUMA-node memory stats
$ numastat
node0 node1
numa_hit 15234567 8901234
numa_miss 0 234567
numa_foreign 234567 0
interleave_hit 12345 12345
local_node 15234567 8901234
other_node 0 234567
numa_miss > 0 means processes are allocating memory on a remote node. This can significantly impact latency-sensitive workloads.
# Pin a process to specific NUMA nodes
$ numactl --cpunodebind=0 --membind=0 ./my_application
# Interleave memory across all nodes (good for general workloads)
$ numactl --interleave=all ./my_application
cgroups Memory Limits¶
cgroups (control groups) allow limiting memory usage per process group. This is how containers enforce memory limits.
cgroups v2 (modern)¶
# View a container's memory limit
$ cat /sys/fs/cgroup/system.slice/docker-<hash>.scope/memory.max
536870912 # 512 MB
# View current memory usage
$ cat /sys/fs/cgroup/system.slice/docker-<hash>.scope/memory.current
268435456 # 256 MB
# Key files
memory.max # Hard limit (OOM kill if exceeded)
memory.high # Throttle point (reclaim pressure)
memory.low # Best-effort protection (try not to reclaim)
memory.min # Hard protection (guaranteed minimum)
memory.current # Current usage
memory.swap.max # Swap limit
memory.swap.current # Current swap usage
memory.stat # Detailed statistics
memory.pressure # PSI metrics for this cgroup
cgroups v1 (legacy)¶
# Docker containers with cgroup v1
$ cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
536870912
$ cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes
268435456
Memory Pressure (PSI)¶
Pressure Stall Information gives you a direct measure of how much memory pressure a system (or cgroup) is under:
$ cat /proc/pressure/memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=12345678
full avg10=0.00 avg60=0.00 avg300=0.00 total=1234567
- some — percentage of time at least one task is stalled on memory
- full — percentage of time all tasks are stalled on memory
Non-zero values indicate memory pressure. This is more reliable than monitoring MemFree.
Slab Allocator¶
The kernel uses the slab allocator for its own internal data structures (inodes, dentries, task structs, etc.):
# View slab usage
$ sudo slabtop -o | head -15
Active / Total Objects (% used) : 3456789 / 4000000 (86.4%)
Active / Total Slabs (% used) : 123456 / 130000 (95.0%)
Active / Total Caches (% used) : 100 / 150 (66.7%)
Active / Total Size (% used) : 1200.00M / 1400.00M (85.7%)
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
890123 889000 99% 0.19K 42387 21 169548K dentry
567890 560000 98% 0.58K 40563 14 324504K inode_cache
234567 230000 98% 0.12K 6899 34 27596K kernfs_node_cache
Large dentry and inode_cache numbers are normal — the kernel caches directory entries and inode metadata. This memory is reclaimable (counted in SReclaimable).
Process Memory Investigation¶
Understanding Process Memory¶
# Overall memory map summary
$ pmap -x 1234 | tail -5
total kB 4194304 2097152 1048576
# Columns: virtual size, RSS (resident), dirty
# Detailed view of memory regions
$ cat /proc/1234/smaps_rollup
Rss: 2097152 kB
Pss: 1048576 kB # Proportional Share Size (shared pages divided by sharers)
Shared_Clean: 1024000 kB
Shared_Dirty: 24576 kB
Private_Clean: 1024 kB
Private_Dirty: 1023552 kB
Swap: 0 kB
Key metrics: - VSZ (Virtual Size) — total virtual address space. Often huge, often meaningless (Java loves to map huge virtual spaces). - RSS (Resident Set Size) — physical RAM used. Overcounts shared pages. - PSS (Proportional Set Size) — RSS with shared pages divided by the number of sharing processes. Most accurate per-process measure.
Finding Memory Leaks¶
# Watch a process's memory growth over time
$ while true; do
ps -o pid,rss,vsz,comm -p 1234
sleep 60
done
# Detailed per-mapping breakdown
$ cat /proc/1234/smaps | grep -E "^[0-9a-f]|Rss|Pss|Private" | head -40
# Sum anonymous (heap) memory
$ awk '/^Rss:/ {sum+=$2} END {print sum " kB"}' /proc/1234/smaps
Clearing Page Cache¶
Sometimes you need to drop caches for benchmarking or freeing memory in an emergency:
# Drop page cache only (safe)
$ echo 1 | sudo tee /proc/sys/vm/drop_caches
# Drop dentries and inodes (safe but may slow filesystem operations briefly)
$ echo 2 | sudo tee /proc/sys/vm/drop_caches
# Drop all (page cache + dentries + inodes)
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
# Sync first to flush dirty pages to disk
$ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
This is safe for production — it only drops clean, reclaimable caches. Dirty pages are synced first if you run sync. Performance may dip temporarily as the cache rebuilds.
Key sysctl Parameters¶
# Memory overcommit behavior
vm.overcommit_memory = 0 # 0=heuristic, 1=always, 2=strict
vm.overcommit_ratio = 50 # Used with overcommit_memory=2
# Swap behavior
vm.swappiness = 60 # 0-200, how aggressively to swap
vm.vfs_cache_pressure = 100 # Tendency to reclaim dentry/inode cache (default 100)
# Dirty page writeback
vm.dirty_ratio = 20 # % of RAM that can be dirty before BLOCKING writes
vm.dirty_background_ratio = 10 # % of RAM that triggers background writeback
vm.dirty_expire_centisecs = 3000 # Age (30s) before dirty pages must be written
# OOM behavior
vm.panic_on_oom = 0 # 0=kill process, 1=kernel panic on OOM
vm.oom_kill_allocating_task = 1 # Kill the task that triggered OOM (not always the biggest)
# Zone reclaim
vm.zone_reclaim_mode = 0 # 0=disabled (default on most), 1=reclaim from local zone first
vm.min_free_kbytes = 65536 # Minimum free memory to maintain (kernel reserve)
Summary¶
Linux memory management is a layered system: virtual memory provides isolation and overcommit, page tables and TLB handle translation, the page cache accelerates I/O, reclaim and swap handle memory pressure, and the OOM killer is the last resort. The most common operational mistake is misreading free output — a system with low "free" but high "available" memory is healthy. Monitor MemAvailable, watch for OOM kills, understand your swap configuration, and disable THP if you run latency-sensitive databases.
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Related Content¶
- /proc Filesystem (Topic Pack, L2) — Linux Fundamentals, Linux Performance Tuning
- Linux Performance Tuning (Topic Pack, L2) — Linux Fundamentals, Linux Performance Tuning
- Advanced Bash for Ops (Topic Pack, L1) — Linux Fundamentals
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Linux Fundamentals
- Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Linux Fundamentals
- Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Linux Fundamentals
- Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Linux Fundamentals
- Case Study: Disk Full Root Services Down (Case Study, L1) — Linux Fundamentals
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Linux Fundamentals
- Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Linux Fundamentals
Pages that link here¶
- Anti-Primer: Linux Memory Management
- Certification Prep: CKA — Certified Kubernetes Administrator
- Incident Replay: Memory ECC Errors Increasing
- Incident Replay: OOM Killer Events
- Incident Replay: Server Intermittent Reboots
- Linux Memory Management
- Linux Performance Tuning
- Production Readiness Review: Answer Key
- Production Readiness Review: Study Plans
- Symptoms
- Symptoms
- Symptoms
- Symptoms
- Symptoms: Disk Full Alert, Cause Is Runaway Logs, Fix Is Loki Retention
- Symptoms: HPA Flapping, Metrics Server Clock Skew, Fix Is NTP Config