Portal | Level: L2: Operations | Topics: Linux Performance Tuning, Linux Fundamentals, eBPF, Linux Ops Performance Triage | Domain: Linux
Linux Performance Tuning - Primer¶
Why This Matters¶
Every outage post-mortem eventually lands on the same question: "Why was the system slow?" If you run infrastructure, performance tuning is not optional — it is the difference between a 3 AM page and a quiet night. You do not need to be a kernel developer to tune Linux effectively, but you do need a methodology. Guessing is not a methodology.
This primer gives you a structured way to find bottlenecks, understand what the kernel is doing, and turn the right knobs. We cover CPU, memory, disk, and network — the four food groups of systems performance.
The USE Method¶
Who made it: Brendan Gregg developed the USE method while at Joyent (now Samsung) in 2012, and later popularized it in his book Systems Performance: Enterprise and the Cloud (2013, Prentice Hall). He is also the creator of flame graphs and the DTrace toolkit.
Brendan Gregg's USE method is the single most effective framework for performance analysis. For every resource, check three things:
| Metric | What It Means | Example Tool |
|---|---|---|
| Utilization | How busy is the resource (% time busy)? | mpstat, iostat |
| Saturation | How much extra work is queued? | vmstat (r col) |
| Errors | Are there error events? | dmesg, ethtool |
Work through this checklist for CPU, memory, disk, and network. If you find high utilization, dig deeper. If you find saturation, the resource is the bottleneck. If you find errors, fix them first — tuning a broken system is pointless.
Utilization vs Saturation¶
This distinction trips people up:
Utilization = 80% CPU busy
-> System is working hard but possibly fine
Saturation = 12 processes in run queue on 4-core box
-> Work is waiting. Users feel this as latency.
High utilization is not automatically a problem. Saturation always is.
CPU Performance¶
Key Metrics¶
# Overall CPU utilization per core
mpstat -P ALL 1
# Process-level CPU usage with threads
pidstat -t 1
# Run queue depth (first column 'r')
vmstat 1
# Scheduler latency
perf sched latency
Understanding CPU States¶
%usr — user-space code execution
%sys — kernel-space (syscalls, interrupts)
%iowait — CPU idle, waiting for I/O completion
%steal — hypervisor took your cycles (VMs only)
%idle — genuinely idle
High %sys often means excessive syscalls — strace the suspect process. High %iowait is misleading: it means the CPU had nothing else to do while waiting for I/O, not that I/O is necessarily slow.
Gotcha:
%iowaitis the most misunderstood metric in Linux. It does NOT measure I/O pressure — it measures CPU idle time during which I/O was pending. If you add a CPU-bound process,%iowaitdrops even though I/O has not improved. Always cross-reference withiostat -x(await,%util) to confirm actual I/O pressure.
CPU Tuning Knobs¶
# Check current CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Set performance governor (skip frequency scaling)
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Disable transparent hugepages if they cause latency spikes (databases)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
# Pin process to specific CPUs (avoid NUMA cross-node traffic)
taskset -c 0-3 /usr/bin/myapp
NUMA Awareness¶
On multi-socket servers, memory access is not uniform:
┌──────────┐ ┌──────────┐
│ CPU 0 │─local──▶│ Memory 0 │ ~80ns
│ Socket │ │ Node 0 │
└────┬─────┘ └──────────┘
│ interconnect (~150ns)
┌────┴─────┐ ┌──────────┐
│ CPU 1 │─local──▶│ Memory 1 │ ~80ns
│ Socket │ │ Node 1 │
└──────────┘ └──────────┘
# Check NUMA topology
numactl --hardware
# Check NUMA memory allocation stats
numastat -c
# Run process on node 0 only
numactl --cpunodebind=0 --membind=0 /usr/bin/myapp
Memory Performance¶
Key Metrics¶
# Memory overview (free vs available is critical)
free -h
# Page faults, swap activity
vmstat 1
# Per-process memory breakdown
pmap -x $(pidof myapp)
# Slab allocator (kernel memory)
slabtop
The "Available" vs "Free" Trap¶
free = truly unused (512M — looks alarming)
available = free + reclaimable cache (11.1G — actually fine)
Linux uses spare memory for page cache. This is good. Do not panic when free is low.
Swap and Swappiness¶
# Check current swappiness (0-100)
cat /proc/sys/vm/swappiness
# Reduce swap tendency for databases/latency-sensitive workloads
sysctl vm.swappiness=10
# Make persistent
echo "vm.swappiness=10" >> /etc/sysctl.d/99-tuning.conf
Swappiness does not mean "swap at X% memory usage." It controls how aggressively the kernel reclaims anonymous pages vs page cache. Lower values prefer keeping application memory resident.
Huge Pages¶
# Check huge page allocation
grep Huge /proc/meminfo
# Reserve 1024 x 2MB huge pages
sysctl vm.nr_hugepages=1024
# Verify
cat /proc/meminfo | grep HugePages
Huge pages reduce TLB misses for large-memory applications (databases, JVMs). Transparent Huge Pages (THP) attempt to do this automatically but can cause latency spikes — most database vendors say disable THP.
Disk / I/O Performance¶
Key Metrics¶
# Per-device I/O stats (util%, await, queue depth)
iostat -xz 1
# Per-process I/O
iotop -oP
# Block layer latency
biolatency-bpfcc # BCC tools
Reading iostat Output¶
Device rrqm/s wrqm/s r/s w/s rMB/s wMB/s await %util
sda 0.2 12.4 3.1 145.2 0.01 18.3 2.1 67.3
nvme0n1 0.0 0.0 12.5 89.4 0.80 45.2 0.3 12.1
%util> 70% on spinning disk = saturated. On NVMe,%utilis misleading (parallel queues).await= average I/O latency in ms. This is what applications feel.rrqm/s,wrqm/s= merged requests. Low merging on sequential workloads = possible misalignment.
I/O Scheduler Tuning¶
# Check current scheduler
cat /sys/block/sda/queue/scheduler
# Set deadline for databases on spinning disk
echo deadline > /sys/block/sda/queue/scheduler
# For NVMe, use none (mq-deadline or none)
echo none > /sys/block/nvme0n1/queue/scheduler
# Increase queue depth for high-throughput workloads
echo 256 > /sys/block/sda/queue/nr_requests
Filesystem Tuning¶
# Mount with noatime (skip access time updates)
mount -o remount,noatime /data
# In /etc/fstab:
/dev/sda1 /data ext4 defaults,noatime,discard 0 2
# XFS: check allocation group count
xfs_info /data
Network Performance¶
Key Metrics¶
# Interface stats (errors, drops, overruns)
ip -s link show eth0
# Socket buffer usage
ss -s
# TCP retransmits (packet loss indicator)
netstat -s | grep retransmit
# Network bandwidth test
iperf3 -c target-host -t 30
Sysctl Network Tuning¶
# Increase socket buffer sizes
sysctl net.core.rmem_max=16777216
sysctl net.core.wmem_max=16777216
sysctl net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl net.ipv4.tcp_wmem="4096 65536 16777216"
# Enable TCP BBR congestion control (better than cubic for most cases)
sysctl net.core.default_qdisc=fq
sysctl net.ipv4.tcp_congestion_control=bbr
# Increase connection backlog
sysctl net.core.somaxconn=65535
sysctl net.ipv4.tcp_max_syn_backlog=65535
# Increase ephemeral port range
sysctl net.ipv4.ip_local_port_range="1024 65535"
# Enable TCP timestamps and window scaling
sysctl net.ipv4.tcp_timestamps=1
sysctl net.ipv4.tcp_window_scaling=1
Interrupt Coalescing and RSS¶
# Check interrupt affinity
cat /proc/interrupts | grep eth0
# Distribute interrupts across CPUs (Receive Side Scaling)
ethtool -l eth0 # show channels
ethtool -L eth0 combined 8 # set 8 queues
# Check ring buffer sizes
ethtool -g eth0
ethtool -G eth0 rx 4096 tx 4096
Perf and Flamegraphs¶
perf is the Swiss Army knife of Linux profiling.
# Sample CPU stacks at 99Hz for 30 seconds
perf record -F 99 -ag -- sleep 30
# Generate report
perf report
# For flamegraphs (Brendan Gregg's tool)
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
Reading a Flamegraph¶
┌─────────────────────────────────────────────────┐
│ application_handler │ <- top of stack
├───────────────────────┬─────────────────────────┤ (where CPU time
│ json_parse │ db_query │ is spent)
├──────────┬────────────┼────────┬────────────────┤
│ malloc │ str_copy │ socket │ query_compile │
├──────────┴────────────┴────────┴────────────────┤
│ main │ <- bottom of stack
└─────────────────────────────────────────────────┘
Width = proportion of CPU time
Wide plateaus at the top = where the CPU is burning. Look for these first.
Strace for Syscall Analysis¶
# Trace a running process (summary mode)
strace -cp $(pidof myapp)
# Trace specific syscalls with timing
strace -e trace=open,read,write -T -p $(pidof myapp)
# Trace a new process with full paths
strace -f -e trace=file -o /tmp/trace.log ./myapp
The -c flag gives you a summary table showing which syscalls consume the most time — invaluable for narrowing down where an app is slow.
Sysctl Tuning Methodology¶
Do not cargo-cult sysctl values from blog posts. Follow this process:
- Baseline — measure current performance with a realistic workload
- Identify bottleneck — USE method across CPU, memory, disk, network
- Research the knob — read the kernel documentation, not a Medium post
- Change one thing — single variable at a time
- Measure again — same workload, same measurement method
- Persist or revert — if improvement confirmed, add to
/etc/sysctl.d/
# Apply all sysctl configs
sysctl --system
# Verify a specific setting
sysctl net.ipv4.tcp_congestion_control
SAR for Historical Analysis¶
# Install sysstat and enable collection
systemctl enable --now sysstat
# CPU history for today
sar -u
# Memory history
sar -r
# Disk I/O history
sar -d
# Network history
sar -n DEV
# Specific time range
sar -u -s 14:00:00 -e 15:00:00
# Yesterday's data
sar -u -f /var/log/sa/sa$(date -d yesterday +%d)
SAR is your time machine. When someone asks "was the system slow yesterday at 2 PM?" — SAR answers that.
Remember: Mnemonic for the first 60 seconds checklist: "Up, Dmesg, VM, IO, Free, Top, Sockets, SAR" — or just remember
uptime; dmesg -T | tail; vmstat 1 5; iostat -xz 1 5; free -h; top -bn1 | head; ss -s; sar -n DEV 1 5. These eight commands cover all four resource categories (CPU, memory, disk, network) and take under two minutes.
Quick Reference: Tool Selection¶
"CPU is high" -> mpstat, perf top, pidstat
"Memory is low" -> free -h, vmstat, slabtop
"Disk is slow" -> iostat -xz, iotop, biolatency
"Network is slow" -> ss, netstat -s, iperf3, ethtool
"App is slow" -> strace -cp, perf record, flamegraph
"What happened?" -> sar, dmesg, journalctl
"Everything is slow" -> USE method, start with vmstat 1
Performance tuning is detective work. The tools give you evidence. The USE method gives you a systematic way to work through it. Do not skip the methodology and jump straight to sysctl knobs — that is how you make things worse.
Performance Triage¶
The Four Resources¶
Every performance problem is ultimately a bottleneck in one of four resources:
| Resource | Saturated means | Key indicator |
|---|---|---|
| CPU | Processes waiting for compute time | Load average, %us+%sy in top |
| Memory | System swapping or OOM killing | si/so in vmstat, MemAvailable in /proc/meminfo |
| Disk I/O | Processes blocked on read/write | await in iostat, %iowait in top |
| Network | Bandwidth saturated or connections dropped | ss queue depths, sar -n DEV |
First 60 Seconds Checklist¶
When you land on a slow box, run these in order:
uptime # load average trend (1/5/15 min)
dmesg -T | tail -20 # kernel errors, OOM kills, hardware issues
vmstat 1 5 # CPU, memory, swap, I/O overview
iostat -xz 1 5 # per-disk I/O stats
free -h # memory overview (look at "available")
top -bn1 | head -30 # top processes by CPU
ss -s # socket summary (connection counts)
sar -n DEV 1 5 # network throughput per interface
These eight commands tell you which resource is the bottleneck in under two minutes.
Triage Decision Tree¶
Symptom: "The system is slow"
|
+-> CPU bound? -> top shows high %CPU, load average > core count
+-> Memory bound? -> free shows low available, vmstat shows swap activity
+-> I/O bound? -> iostat shows high await/util, top shows %iowait
+-> Network bound? -> sar shows high throughput, ss shows full queues
+-> None of these? -> It's probably not this machine. Check upstream dependencies.
Triage Tool Quick Reference¶
| Tool | What it shows | When to use it |
|---|---|---|
top/htop |
Process-level CPU/memory | First look, find the offending process |
vmstat |
System-wide CPU/memory/swap/IO | Quick overview, detect swap pressure |
iostat |
Per-disk I/O performance | I/O bottleneck investigation |
free |
Memory usage summary | Memory pressure check |
sar |
Historical performance data | Trending, after-the-fact analysis |
perf |
CPU profiling, flame graphs | Deep CPU investigation |
strace |
System call tracing | "What is this process doing?" |
/proc/* |
Kernel-exported runtime data | When tools aren't installed |
Triage Heuristics¶
- Check
dmesgearly. Hardware errors, OOM kills, and kernel panics show up here. Five seconds can save an hour of investigation. - Load average is a blunt instrument. Always decompose it: is it CPU (
rin vmstat) or I/O (bin vmstat)? - Never trust a single metric. High CPU + high await = the CPU might look busy but it's actually waiting for disk.
- The "available" memory in
free -his the only memory number that matters for "are we running out of memory?" - If everything looks fine on the box, the problem is elsewhere. Check upstream services, DNS, network path, and the calling application.
- Performance problems are often caused by recent changes. Ask "what changed?" before deep-diving into metrics.
perfis the nuclear option for CPU issues. If top shows a process at 100% CPU but you don't know why,perf record -p <pid> -gfor 10 seconds thenperf reportwill show you the code path.
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Next Steps¶
- Kernel Troubleshooting (Topic Pack, L3)
- Linux Kernel Tuning (Topic Pack, L2)
Related Content¶
- /proc Filesystem (Topic Pack, L2) — Linux Fundamentals, Linux Performance Tuning
- Linux Memory Management (Topic Pack, L1) — Linux Fundamentals, Linux Performance Tuning
- eBPF & Modern Linux Observability (Topic Pack, L3) — eBPF, Linux Fundamentals
- Advanced Bash for Ops (Topic Pack, L1) — Linux Fundamentals
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Linux Fundamentals
- Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Linux Fundamentals
- Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Linux Fundamentals
- Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Linux Fundamentals
- Case Study: Disk Full Root Services Down (Case Study, L1) — Linux Fundamentals
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Linux Fundamentals
Pages that link here¶
- Anti-Primer: Linux Performance
- Incident Replay: Kernel Soft Lockup
- Kernel Troubleshooting
- Linux Kernel Tuning
- Linux Memory Management
- Linux Performance Tuning
- Master Curriculum: 40 Weeks
- Production Readiness Review: Answer Key
- Production Readiness Review: Study Plans
- Symptoms
- Symptoms: Disk Full Alert, Cause Is Runaway Logs, Fix Is Loki Retention
- Thinking Out Loud: Linux Performance
- eBPF & Modern Linux Observability