Portal | Level: L1: Foundations | Topics: Linux Memory Management, Linux Fundamentals, Linux Performance Tuning | Domain: Linux

Linux Memory Management — Primer¶

Why This Matters¶

Memory is the most common resource constraint in production systems. When a container gets OOM-killed at 3 AM, when your database performance drops because Linux reclaimed its page cache, when your application's latency spikes because of transparent huge page compaction — these are all memory management problems. Understanding how Linux manages memory is not optional for anyone who operates production systems.

This primer covers the full memory landscape: how virtual memory works, what the kernel is doing behind the scenes, how to read the numbers correctly, and how the various tuning knobs affect system behavior.

Virtual Memory Fundamentals¶

Physical vs Virtual Addressing¶

Every process sees its own virtual address space. On a 64-bit system, each process theoretically has 256 TB of addressable memory (48-bit addressing in practice). The kernel's page table maps virtual addresses to physical RAM pages.

Process A sees:           Physical RAM:          Process B sees:
┌──────────────┐         ┌──────────────┐       ┌──────────────┐
│ 0x00400000   │────────▶│ Frame 1234   │◀──────│ 0x00400000   │
│ (code)       │         │              │       │ (code)       │
├──────────────┤         ├──────────────┤       ├──────────────┤
│ 0x7fff0000   │────────▶│ Frame 5678   │       │ 0x7fff0000   │────▶Frame 9012
│ (stack)      │         │              │       │ (stack)      │
└──────────────┘         └──────────────┘       └──────────────┘

Key benefits: - Isolation — processes can't read each other's memory (without explicit sharing) - Overcommit — you can allocate more virtual memory than physical RAM exists - Shared pages — multiple processes can share the same physical page (shared libraries, CoW after fork)

Pages¶

Memory is managed in pages — fixed-size blocks (4 KB on x86_64). Everything the kernel does with memory operates at page granularity. When you malloc(1), the kernel allocates at least one 4 KB page.

Page Tables¶

The page table is a hierarchical data structure that maps virtual page numbers to physical frame numbers. On x86_64, it's 4 levels deep (PGD → PUD → PMD → PTE). Each level is a page of pointers.

A TLB (Translation Lookaside Buffer) caches recent translations in hardware. TLB misses are expensive, which is why huge pages exist.

Fun fact: A typical x86_64 CPU has only 1,024-2,048 TLB entries. With 4 KB pages, that covers just 4-8 MB of memory. A process using 4 GB of RAM needs 1 million page table entries but can only cache a tiny fraction in the TLB. This is why 2 MB huge pages (covering 2-4 GB in the same TLB) can improve performance by 5-20% for memory-intensive workloads.

Reading /proc/meminfo¶

This is the most important file for understanding memory state:

$ cat /proc/meminfo
MemTotal:       65536000 kB    # Total physical RAM
MemFree:         2048000 kB    # Truly unused RAM
MemAvailable:   45056000 kB    # Estimated available for new allocations
Buffers:          512000 kB    # Block device I/O buffers
Cached:         40960000 kB    # Page cache (file data in RAM)
SwapCached:        32000 kB    # Swap pages also in RAM
Active:         30720000 kB    # Recently accessed pages
Inactive:       25600000 kB    # Not recently accessed (reclaim candidates)
Active(anon):   15360000 kB    # Active anonymous pages (heap, stack)
Inactive(anon):  5120000 kB    # Inactive anonymous pages
Active(file):   15360000 kB    # Active file-backed pages (page cache)
Inactive(file): 20480000 kB    # Inactive file-backed pages
SwapTotal:       8192000 kB    # Total swap space
SwapFree:        8000000 kB    # Free swap space
Dirty:             12000 kB    # Pages waiting to be written to disk
Writeback:             0 kB    # Pages being written right now
Slab:            3072000 kB    # Kernel slab allocator
SReclaimable:    2560000 kB    # Slab pages that can be reclaimed
SUnreclaim:       512000 kB    # Slab pages that cannot be reclaimed
Hugepages_Total:       0       # Configured huge pages
Hugepages_Free:        0       # Unused huge pages

The Crucial Distinction: Free vs Available¶

MemFree is RAM that's literally unused — no data, no cache, nothing. On a healthy system, this is often very low. That's fine.

MemAvailable is the kernel's estimate of how much memory is available for new allocations without swapping. It includes MemFree plus reclaimable page cache and slab. This is the number you should monitor.

A system with 64 GB RAM, 2 GB MemFree, and 45 GB MemAvailable is healthy — the kernel is using RAM for caching, which makes the system faster.

Buffers vs Cached¶

Buffers — metadata buffers for block devices (directory entries, inode tables). Small, rarely significant.
Cached — the page cache. When you read a file, the kernel keeps the data in RAM in case you read it again. This is why free shows most of your RAM as "used" — it's used for caching, and will be freed on demand.

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           62Gi        15Gi       1.9Gi       256Mi        45Gi        44Gi
Swap:         7.8Gi          0B       7.8Gi

The available column is what matters. Ignore free. The system above has 44 GB available despite only 1.9 GB "free."

Memory Zones¶

Physical RAM is divided into zones based on address ranges:

Zone	Address Range	Purpose
`ZONE_DMA`	0-16 MB	Legacy ISA DMA
`ZONE_DMA32`	0-4 GB	32-bit DMA devices
`ZONE_NORMAL`	4 GB+	Normal allocations (most memory)
`ZONE_MOVABLE`	Configurable	Pages that can be migrated (for memory hotplug)

$ cat /proc/zoneinfo | grep -E "^Node|pages free|managed"
Node 0, zone      DMA
  pages free     3845
        managed  3976
Node 0, zone    DMA32
  pages free     208456
        managed  520844
Node 0, zone   Normal
  pages free     512000
        managed  15728640

For most operators, zones are invisible — the kernel manages them automatically. They matter when debugging "out of memory in DMA zone" errors on systems that technically have plenty of RAM.

Page Reclaim and Swapping¶

How the Kernel Reclaims Memory¶

When free memory gets low, the kernel reclaims pages. The page reclaim algorithm:

File-backed pages (page cache) — can be dropped (clean pages) or written back and dropped (dirty pages). This is cheap.
Anonymous pages (heap, stack, mmap'd anonymous) — must be written to swap before being freed. This is expensive.

The kernel maintains two LRU (Least Recently Used) lists per zone: active and inactive. Pages move between them based on access patterns:

              ┌───────────────┐
Access ──────▶│  Active List   │
              │ (recently used)│
              └───────┬───────┘
                      │ aging
              ┌───────▼───────┐
              │ Inactive List  │
              │ (reclaim cand.)│
              └───────┬───────┘
                      │ reclaim
              ┌───────▼───────┐
              │   Free Pool    │
              └───────────────┘

Swappiness¶

The vm.swappiness parameter (0-200, default 60) controls the kernel's tendency to swap anonymous pages vs reclaim file-backed pages.

# View current swappiness
$ cat /proc/sys/vm/swappiness
60

# Set temporarily
$ sudo sysctl vm.swappiness=10

# Set permanently
$ echo "vm.swappiness=10" | sudo tee -a /etc/sysctl.d/99-memory.conf
$ sudo sysctl --system

Value	Behavior
0	Avoid swapping anonymous pages as much as possible (still swaps to avoid OOM)
10	Low swap tendency (common for database servers)
60	Default balance
100	Equal treatment of anonymous and file-backed pages
200	(cgroup v2 only) Aggressively swap anonymous pages

For database servers, set swappiness to 10 or lower — databases manage their own caching and the page cache is less valuable.

Swap Space¶

Swap provides overflow for anonymous memory. When physical RAM is exhausted, the kernel writes anonymous pages to swap, freeing RAM for active use.

# View swap usage
$ swapon --show
NAME      TYPE SIZE USED PRIO
/swapfile file   8G 192M   -2

# Create a swap file
$ sudo fallocate -l 4G /swapfile
$ sudo chmod 600 /swapfile
$ sudo mkswap /swapfile
$ sudo swapon /swapfile

# Add to fstab for persistence
$ echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

The OOM Killer¶

Under the hood: The OOM killer was added to the Linux kernel by Rik van Riel in the early 2000s. It is intentionally a last resort — the kernel tries hard to avoid it by reclaiming page cache, writing dirty pages, and swapping. When the OOM killer fires, it means all those mechanisms were exhausted.

When the system is truly out of memory (RAM + swap), the kernel invokes the OOM (Out Of Memory) killer. It selects a process to kill based on a scoring algorithm.

OOM Score¶

Every process has an OOM score in /proc/<pid>/oom_score (0-1000). Higher = more likely to be killed. The score is based on: - Memory usage (dominant factor) - CPU time consumed - Number of child processes - Whether the process is privileged (root gets a discount)

# View OOM scores for all processes
$ ps -eo pid,oom_score,comm --sort=-oom_score | head -10
  PID OOM_SCORE COMMAND
 1234       450 java
 2345       200 python3
 3456       150 node

Adjusting OOM Score¶

# Make a process less likely to be killed (-1000 to 1000)
$ echo -500 | sudo tee /proc/1234/oom_score_adj

# Make a process immune to OOM killer (DANGEROUS)
$ echo -1000 | sudo tee /proc/1234/oom_score_adj

# In a systemd unit:
[Service]
OOMScoreAdjust=-900

Detecting OOM Kills¶

# Check dmesg for OOM events
$ dmesg | grep -i "out of memory\|oom-killer\|killed process"
[12345.678901] Out of memory: Killed process 1234 (java) total-vm:8388608kB, anon-rss:4194304kB

# Check journal
$ journalctl -k --grep="oom|killed process"

Overcommit Settings¶

By default, Linux allows processes to allocate more virtual memory than physical RAM + swap. This is called overcommit.

$ cat /proc/sys/vm/overcommit_memory
0

Value	Behavior
0	Heuristic (default) — kernel uses heuristics to decide if allocation is "reasonable." Most allocations succeed.
1	Always — always allow allocations, never refuse. malloc() never fails. OOM killer handles the consequences.
2	Never — total virtual memory is limited to `swap + (RAM × overcommit_ratio/100)`. malloc() fails when limit reached.

With mode 2, the overcommit ratio controls the limit:

$ cat /proc/sys/vm/overcommit_ratio
50
# Meaning: max virtual memory = swap + (RAM × 50%)
# With 64 GB RAM and 8 GB swap: max = 8 + 32 = 40 GB

Mode 0 (default) is correct for most workloads. Mode 2 is used in scientific computing where you'd rather have malloc() fail than have the OOM killer randomly kill processes.

Huge Pages¶

Standard 4 KB pages mean a 1 GB mapping requires 262,144 page table entries and TLB entries. Huge pages use larger page sizes to reduce this overhead.

Transparent Huge Pages (THP)¶

The kernel automatically merges contiguous 4 KB pages into 2 MB huge pages:

# Check THP status
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

# Disable THP (recommended for databases)
$ echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
$ echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

THP provides automatic performance improvement for many workloads, but can cause latency spikes in databases (Redis, MongoDB, PostgreSQL) due to page compaction. Most database documentation recommends disabling THP.

Explicit Huge Pages¶

Pre-allocated, reserved huge pages:

# Allocate 1024 × 2MB huge pages (2 GB total)
$ echo 1024 | sudo tee /proc/sys/vm/nr_hugepages

# Or at boot via kernel parameter
$ cat /proc/cmdline
... hugepages=1024 ...

# View huge page status
$ cat /proc/meminfo | grep -i huge
HugePages_Total:    1024
HugePages_Free:     1024
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

# Mount a hugetlbfs for application use
$ mount -t hugetlbfs none /mnt/hugepages

Applications like DPDK, large JVMs, and databases use explicit huge pages for predictable performance.

NUMA (Non-Uniform Memory Access)¶

On multi-socket servers, each CPU socket has its own local memory. Accessing local memory is fast; accessing remote memory (on another socket) is slower.

# Check NUMA topology
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32768 MB
node 0 free: 16384 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 8192 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

# View per-NUMA-node memory stats
$ numastat
                           node0           node1
numa_hit                 15234567         8901234
numa_miss                       0          234567
numa_foreign                234567               0
interleave_hit              12345           12345
local_node               15234567         8901234
other_node                      0          234567

numa_miss > 0 means processes are allocating memory on a remote node. This can significantly impact latency-sensitive workloads.

# Pin a process to specific NUMA nodes
$ numactl --cpunodebind=0 --membind=0 ./my_application

# Interleave memory across all nodes (good for general workloads)
$ numactl --interleave=all ./my_application

cgroups Memory Limits¶

cgroups (control groups) allow limiting memory usage per process group. This is how containers enforce memory limits.

cgroups v2 (modern)¶

# View a container's memory limit
$ cat /sys/fs/cgroup/system.slice/docker-<hash>.scope/memory.max
536870912   # 512 MB

# View current memory usage
$ cat /sys/fs/cgroup/system.slice/docker-<hash>.scope/memory.current
268435456   # 256 MB

# Key files
memory.max          # Hard limit (OOM kill if exceeded)
memory.high         # Throttle point (reclaim pressure)
memory.low          # Best-effort protection (try not to reclaim)
memory.min          # Hard protection (guaranteed minimum)
memory.current      # Current usage
memory.swap.max     # Swap limit
memory.swap.current # Current swap usage
memory.stat         # Detailed statistics
memory.pressure     # PSI metrics for this cgroup

cgroups v1 (legacy)¶

# Docker containers with cgroup v1
$ cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
536870912

$ cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes
268435456

Memory Pressure (PSI)¶

Pressure Stall Information gives you a direct measure of how much memory pressure a system (or cgroup) is under:

$ cat /proc/pressure/memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=12345678
full avg10=0.00 avg60=0.00 avg300=0.00 total=1234567

some — percentage of time at least one task is stalled on memory
full — percentage of time all tasks are stalled on memory

Non-zero values indicate memory pressure. This is more reliable than monitoring MemFree.

Slab Allocator¶

The kernel uses the slab allocator for its own internal data structures (inodes, dentries, task structs, etc.):

# View slab usage
$ sudo slabtop -o | head -15
 Active / Total Objects (% used)    : 3456789 / 4000000 (86.4%)
 Active / Total Slabs (% used)      : 123456 / 130000 (95.0%)
 Active / Total Caches (% used)     : 100 / 150 (66.7%)
 Active / Total Size (% used)       : 1200.00M / 1400.00M (85.7%)

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
890123 889000  99%    0.19K  42387       21    169548K dentry
567890 560000  98%    0.58K  40563       14    324504K inode_cache
234567 230000  98%    0.12K   6899       34     27596K kernfs_node_cache

Large dentry and inode_cache numbers are normal — the kernel caches directory entries and inode metadata. This memory is reclaimable (counted in SReclaimable).

Process Memory Investigation¶

Understanding Process Memory¶

# Overall memory map summary
$ pmap -x 1234 | tail -5
total kB         4194304  2097152  1048576

# Columns: virtual size, RSS (resident), dirty

# Detailed view of memory regions
$ cat /proc/1234/smaps_rollup
Rss:             2097152 kB
Pss:             1048576 kB    # Proportional Share Size (shared pages divided by sharers)
Shared_Clean:    1024000 kB
Shared_Dirty:      24576 kB
Private_Clean:      1024 kB
Private_Dirty:   1023552 kB
Swap:                  0 kB

Key metrics: - VSZ (Virtual Size) — total virtual address space. Often huge, often meaningless (Java loves to map huge virtual spaces). - RSS (Resident Set Size) — physical RAM used. Overcounts shared pages. - PSS (Proportional Set Size) — RSS with shared pages divided by the number of sharing processes. Most accurate per-process measure.

Finding Memory Leaks¶

# Watch a process's memory growth over time
$ while true; do
    ps -o pid,rss,vsz,comm -p 1234
    sleep 60
done

# Detailed per-mapping breakdown
$ cat /proc/1234/smaps | grep -E "^[0-9a-f]|Rss|Pss|Private" | head -40

# Sum anonymous (heap) memory
$ awk '/^Rss:/ {sum+=$2} END {print sum " kB"}' /proc/1234/smaps

Clearing Page Cache¶

Sometimes you need to drop caches for benchmarking or freeing memory in an emergency:

# Drop page cache only (safe)
$ echo 1 | sudo tee /proc/sys/vm/drop_caches

# Drop dentries and inodes (safe but may slow filesystem operations briefly)
$ echo 2 | sudo tee /proc/sys/vm/drop_caches

# Drop all (page cache + dentries + inodes)
$ echo 3 | sudo tee /proc/sys/vm/drop_caches

# Sync first to flush dirty pages to disk
$ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

This is safe for production — it only drops clean, reclaimable caches. Dirty pages are synced first if you run sync. Performance may dip temporarily as the cache rebuilds.

Key sysctl Parameters¶

# Memory overcommit behavior
vm.overcommit_memory = 0        # 0=heuristic, 1=always, 2=strict
vm.overcommit_ratio = 50        # Used with overcommit_memory=2

# Swap behavior
vm.swappiness = 60              # 0-200, how aggressively to swap
vm.vfs_cache_pressure = 100     # Tendency to reclaim dentry/inode cache (default 100)

# Dirty page writeback
vm.dirty_ratio = 20             # % of RAM that can be dirty before BLOCKING writes
vm.dirty_background_ratio = 10  # % of RAM that triggers background writeback
vm.dirty_expire_centisecs = 3000 # Age (30s) before dirty pages must be written

# OOM behavior
vm.panic_on_oom = 0             # 0=kill process, 1=kernel panic on OOM
vm.oom_kill_allocating_task = 1 # Kill the task that triggered OOM (not always the biggest)

# Zone reclaim
vm.zone_reclaim_mode = 0        # 0=disabled (default on most), 1=reclaim from local zone first
vm.min_free_kbytes = 65536      # Minimum free memory to maintain (kernel reserve)

Summary¶

Linux memory management is a layered system: virtual memory provides isolation and overcommit, page tables and TLB handle translation, the page cache accelerates I/O, reclaim and swap handle memory pressure, and the OOM killer is the last resort. The most common operational mistake is misreading free output — a system with low "free" but high "available" memory is healthy. Monitor MemAvailable, watch for OOM kills, understand your swap configuration, and disable THP if you run latency-sensitive databases.

Prerequisites¶

Linux Ops (Topic Pack, L0)

/proc Filesystem (Topic Pack, L2) — Linux Fundamentals, Linux Performance Tuning
Linux Performance Tuning (Topic Pack, L2) — Linux Fundamentals, Linux Performance Tuning
Advanced Bash for Ops (Topic Pack, L1) — Linux Fundamentals
Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Linux Fundamentals
Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Linux Fundamentals
Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Linux Fundamentals
Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Linux Fundamentals
Case Study: Disk Full Root Services Down (Case Study, L1) — Linux Fundamentals
Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Linux Fundamentals
Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Linux Fundamentals

Linux Memory Management — Primer¶

Why This Matters¶

Virtual Memory Fundamentals¶

Physical vs Virtual Addressing¶

Pages¶

Page Tables¶

Reading /proc/meminfo¶

The Crucial Distinction: Free vs Available¶

Buffers vs Cached¶

Memory Zones¶

Page Reclaim and Swapping¶

How the Kernel Reclaims Memory¶

Swappiness¶

Swap Space¶

The OOM Killer¶

OOM Score¶

Adjusting OOM Score¶

Detecting OOM Kills¶

Overcommit Settings¶

Huge Pages¶

Transparent Huge Pages (THP)¶

Explicit Huge Pages¶

NUMA (Non-Uniform Memory Access)¶

cgroups Memory Limits¶

cgroups v2 (modern)¶

cgroups v1 (legacy)¶

Memory Pressure (PSI)¶

Slab Allocator¶

Process Memory Investigation¶

Understanding Process Memory¶

Finding Memory Leaks¶

Clearing Page Cache¶

Key sysctl Parameters¶

Summary¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

Linux Memory Management — Primer¶

Why This Matters¶

Virtual Memory Fundamentals¶

Physical vs Virtual Addressing¶

Pages¶

Page Tables¶

Reading /proc/meminfo¶

The Crucial Distinction: Free vs Available¶

Buffers vs Cached¶

Memory Zones¶

Page Reclaim and Swapping¶

How the Kernel Reclaims Memory¶

Swappiness¶

Swap Space¶

The OOM Killer¶

OOM Score¶

Adjusting OOM Score¶

Detecting OOM Kills¶

Overcommit Settings¶

Huge Pages¶

Transparent Huge Pages (THP)¶

Explicit Huge Pages¶

NUMA (Non-Uniform Memory Access)¶

cgroups Memory Limits¶

cgroups v2 (modern)¶

cgroups v1 (legacy)¶

Memory Pressure (PSI)¶

Slab Allocator¶

Process Memory Investigation¶

Understanding Process Memory¶

Finding Memory Leaks¶

Clearing Page Cache¶

Key sysctl Parameters¶

Summary¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶