Linux Memory Management — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts about Linux memory management.

The OOM Killer is Linux's last resort — and it has killed production databases¶

The Out-of-Memory Killer (OOM Killer) activates when the system is critically low on memory and must sacrifice a process to prevent a total system hang. It selects victims using a scoring algorithm (oom_score) that considers memory usage, process age, and nice value. In production, it has famously killed MySQL, PostgreSQL, and other databases because they tend to have the highest memory footprint.

Linux overcommits memory by default — your malloc can lie to you¶

By default (vm.overcommit_memory=0), Linux allows processes to allocate more virtual memory than physically available. The kernel assumes most processes will not actually use all allocated memory. malloc() almost never returns NULL on Linux — instead, the process gets killed by the OOM Killer when it actually tries to use the memory. Setting overcommit_memory=2 disables this behavior.

The page cache makes Linux appear to use all your RAM¶

Running free on a healthy Linux system often shows nearly 100% RAM utilization, alarming new administrators. Most of this memory is the page cache — the kernel caching filesystem data in RAM for faster access. This memory is immediately reclaimable when applications need it. The "available" column in free (added in Linux 3.14) shows memory actually free for use.

Transparent Huge Pages caused mysterious latency spikes for years¶

THP (Transparent Huge Pages), enabled by default in most distributions, automatically promotes 4 KB pages to 2 MB huge pages. While this reduces TLB misses, the background compaction process (khugepaged) can cause latency spikes of 10-100ms. Redis, MongoDB, and Oracle all recommend disabling THP in production because the latency impact outweighs the TLB benefit.

Linux uses a four-level (now five-level) page table hierarchy¶

The Linux virtual memory system originally used three-level page tables. It was extended to four levels (PGD, PUD, PMD, PTE) in kernel 2.6.11 (2005) to support larger address spaces, and to five levels in kernel 4.14 (2017) to support 57-bit virtual addressing — enough for 128 PB of virtual address space per process.

NUMA architecture means not all RAM is equally fast¶

On multi-socket servers, each CPU socket has its own local memory. Accessing memory attached to a remote socket (a "NUMA miss") takes 1.5-2x longer than local access. The Linux NUMA policy (set via numactl or the mbind() syscall) controls where memory is allocated. Database servers improperly configured for NUMA can lose 30-40% of their theoretical memory bandwidth.

Swappiness is one of the most misunderstood kernel parameters¶

The vm.swappiness parameter (default 60) does not control when swapping starts. It controls the ratio of reclaiming memory from the page cache versus swapping out anonymous pages. A value of 0 tells the kernel to strongly prefer reclaiming cache, while 100 treats cache and anonymous pages equally. Setting swappiness=0 does not disable swap — only swapoff does that.

cgroups memory limits predate containers by years¶

Linux control groups (cgroups) memory accounting was merged in kernel 2.6.25 (2008), five years before Docker launched. cgroups v1 tracked memory per group and could enforce hard limits, but its accounting was notoriously inaccurate — it counted some shared pages multiple times. cgroups v2 (kernel 4.5, 2016) fixed this with unified memory accounting.

The slab allocator was replaced to save memory on modern systems¶

The original slab allocator (introduced in Linux 2.2) wasted memory on metadata for each cache. SLUB (the Unqueued Slab Allocator), which became default in kernel 2.6.23 (2007), reduced per-cache overhead and improved debugging. SLOB (Simple List of Blocks) exists as a third option for memory-constrained embedded systems with less than 64 MB RAM.

Copy-on-write is why fork() is fast even for processes using gigabytes of RAM¶

When a process calls fork(), the kernel does not copy the parent's memory. Instead, both parent and child share the same physical pages marked read-only. Only when either process writes to a page does the kernel copy that specific page. This means forking a 10 GB process is nearly instantaneous — but Redis's background save (BGSAVE) can double memory usage if the dataset is being modified during the fork.