Linux Performance — Trivia & Interesting Facts¶
Surprising, historical, and little-known facts about Linux performance tuning and analysis.
The Linux scheduler has been rewritten four times¶
Linux has had four major CPU schedulers: the original O(n) scheduler (1991), the O(1) scheduler by Ingo Molnar (2002, Linux 2.6), the Completely Fair Scheduler (CFS) by Ingo Molnar (2007, Linux 2.6.23), and EEVDF (Earliest Eligible Virtual Deadline First) which replaced CFS in Linux 6.6 (2023). Each rewrite addressed the previous one's failure modes on emerging workloads.
The page cache consumes most of your "used" memory — and that is correct¶
Linux aggressively caches file data in RAM (the page cache). A system showing 90% memory "used" often has 60-70% as reclaimable cache. The free command's "available" column (added in 2014 in procps-ng 3.3.10) finally showed the correct amount of memory available for applications. Before that, administrators routinely panicked at high memory usage that was actually healthy caching.
Context switches can cost 1-10 microseconds each¶
A context switch — saving one process's CPU registers and loading another's — costs roughly 1-5 microseconds on modern hardware. But the indirect cost is much higher: TLB flushes, cold CPU caches, and pipeline stalls can add 10-100 microseconds of effective overhead. A server doing 100,000 context switches per second is spending significant CPU time just switching.
Huge pages reduce TLB misses by 99%¶
Standard Linux pages are 4 KB. With huge pages (2 MB or 1 GB), a database managing 64 GB of RAM needs 32,768 TLB entries instead of 16,777,216. Since TLBs typically hold only 1,000-2,000 entries, huge pages can reduce TLB misses from constant to near-zero. Transparent Huge Pages (THP) attempts to do this automatically but causes latency spikes that make database vendors universally recommend disabling it.
NUMA awareness can make or break database performance¶
Non-Uniform Memory Access means each CPU socket has "local" and "remote" memory, with remote access taking 1.5-2x longer. A database bound to one NUMA node accessing memory on another can lose 30-40% throughput. numactl --interleave=all spreads memory across nodes, and numastat shows whether memory allocation is balanced. Most performance regressions after hardware upgrades are NUMA-related.
The Linux swappiness parameter is widely misunderstood¶
vm.swappiness (default 60) does not set a memory threshold for swapping. It controls the balance between reclaiming page cache and swapping anonymous pages. A value of 0 does not disable swap — it tells the kernel to prefer dropping cache over swapping, which can cause OOM kills when cache is minimal. The optimal value depends on workload: databases often use 10, while Redis runs with 0.
CPU frequency scaling was originally for laptops but now affects servers¶
CPU frequency governors (ondemand, performance, powersave, schedutil) were designed for laptop battery life. On servers, the default "ondemand" governor can add milliseconds of latency as the CPU ramps up from idle frequency. High-frequency trading firms and latency-sensitive applications set the "performance" governor to lock CPUs at maximum frequency, sacrificing power efficiency for consistent response times.
cgroups were invented at Google¶
Process containers (later renamed cgroups) were developed by Paul Menage and Rohit Seth at Google in 2006 and merged into Linux 2.6.24 (2008). Google needed to isolate and limit resources for the thousands of different workloads sharing their server fleet. cgroups became the foundation of Docker containers, Kubernetes pods, and systemd resource management.
The perf tool samples at 4,000 Hz by default¶
perf record captures stack samples at ~4,000 samples per second by default. Higher rates give more accuracy but also more overhead. At 99 Hz (a common production setting chosen to avoid lockstep with timer interrupts), you can profile for hours with minimal impact. The 99 Hz frequency avoids aliasing artifacts that occur at round numbers like 100 Hz.
BPF-based tools can measure latency that was previously invisible¶
Before eBPF, measuring the time between a process issuing a syscall and the kernel completing it required kernel modification or unreliable strace timing. eBPF tools like biolatency, runqlat, and tcplife measure kernel-internal latency distributions with nanosecond precision and minimal overhead. This revealed that storage latency is often bimodal — most I/Os complete in microseconds, but a small percentage take milliseconds.
I/O schedulers have evolved from elevator algorithms to near-nothing¶
Early Linux used complex I/O schedulers (anticipatory, deadline, CFQ) to optimize seek times on spinning disks. With SSDs having near-zero seek time, the optimal scheduler is often "none" (noop/mq-deadline). The blk-mq (multi-queue block layer), merged in Linux 3.13 (2014), redesigned the entire I/O stack around modern multi-core, multi-queue NVMe hardware.
Flame graphs were invented in 2011 and changed performance analysis forever¶
Brendan Gregg invented flame graphs while at Joyent to visualize stack traces from DTrace and perf profiles. The visualization — where the x-axis is alphabetically sorted (not time-based) and width represents time spent in that function — made it instantly obvious which code paths consumed CPU. Flame graphs are now built into Chrome DevTools, IntelliJ, and most APM tools.
Brendan Gregg's USE method changed how the industry thinks about performance¶
Brendan Gregg, a performance engineer at Netflix (previously Sun/Joyent), created the USE methodology: for every resource, check Utilization, Saturation, and Errors. Published in 2012, USE provides a systematic checklist that prevents the "streetlight effect" — looking only where the tools make it easy to look rather than where the problem actually is.
Load average is the most misunderstood metric in Linux¶
Load average counts both running AND uninterruptible sleeping (D-state) processes, unlike most other Unix systems that only count running processes. This means a system with zero CPU usage but heavy NFS or disk I/O can show a high load average. The three numbers (1, 5, 15 minutes) are exponentially damped moving averages, not simple arithmetic means.
The D-state (uninterruptible sleep) cannot be killed, not even by kill -9¶
Processes in D-state (shown as "D" in ps/top) are waiting for I/O to complete and cannot be interrupted by any signal, including SIGKILL. This design exists because interrupting a disk write mid-operation could corrupt data. A process stuck in D-state usually indicates a hardware problem, NFS hang, or kernel bug. The only fix is often a reboot.
IO wait percentage lies more than any other metric¶
%iowait in top/vmstat is the percentage of time the CPU was idle while there was outstanding I/O. But it is only measured on idle CPUs — if the system is CPU-busy, iowait drops to zero even if I/O is slow. It also varies wildly between measurement intervals. The correct way to measure I/O performance is iostat -x examining await, r_await, and w_await.
dstat was the universal performance tool — and then it was abandoned¶
dstat, which combined vmstat, iostat, netstat, and ifstat into one color-coded real-time output, was one of the most beloved Linux tools. Its creator, Dag Wieers, stopped maintaining it around 2019. The Python 2-to-3 transition killed it. dool is a community fork, and many admins have switched to glances or btop.
The first 60 seconds of triage follow a known script¶
Brendan Gregg's famous "Linux Performance Analysis in 60 Seconds" prescribes exactly 10 commands: uptime, dmesg | tail, vmstat 1, mpstat -P ALL 1, pidstat 1, iostat -xz 1, free -m, sar -n DEV 1, sar -n TCP,ETCP 1, and top. This checklist has been adopted by SRE teams worldwide and is taught in Netflix's internal training.