Skip to content

Portal | Level: L2: Operations | Topics: Linux Performance Tuning, Linux Fundamentals, eBPF, Linux Ops Performance Triage | Domain: Linux

Linux Performance Tuning - Primer

Why This Matters

Every outage post-mortem eventually lands on the same question: "Why was the system slow?" If you run infrastructure, performance tuning is not optional — it is the difference between a 3 AM page and a quiet night. You do not need to be a kernel developer to tune Linux effectively, but you do need a methodology. Guessing is not a methodology.

This primer gives you a structured way to find bottlenecks, understand what the kernel is doing, and turn the right knobs. We cover CPU, memory, disk, and network — the four food groups of systems performance.


The USE Method

Who made it: Brendan Gregg developed the USE method while at Joyent (now Samsung) in 2012, and later popularized it in his book Systems Performance: Enterprise and the Cloud (2013, Prentice Hall). He is also the creator of flame graphs and the DTrace toolkit.

Brendan Gregg's USE method is the single most effective framework for performance analysis. For every resource, check three things:

Metric What It Means Example Tool
Utilization How busy is the resource (% time busy)? mpstat, iostat
Saturation How much extra work is queued? vmstat (r col)
Errors Are there error events? dmesg, ethtool

Work through this checklist for CPU, memory, disk, and network. If you find high utilization, dig deeper. If you find saturation, the resource is the bottleneck. If you find errors, fix them first — tuning a broken system is pointless.

Utilization vs Saturation

This distinction trips people up:

Utilization = 80% CPU busy
  -> System is working hard but possibly fine

Saturation = 12 processes in run queue on 4-core box
  -> Work is waiting. Users feel this as latency.

High utilization is not automatically a problem. Saturation always is.


CPU Performance

Key Metrics

# Overall CPU utilization per core
mpstat -P ALL 1

# Process-level CPU usage with threads
pidstat -t 1

# Run queue depth (first column 'r')
vmstat 1

# Scheduler latency
perf sched latency

Understanding CPU States

%usr   — user-space code execution
%sys   — kernel-space (syscalls, interrupts)
%iowait — CPU idle, waiting for I/O completion
%steal  — hypervisor took your cycles (VMs only)
%idle   — genuinely idle

High %sys often means excessive syscalls — strace the suspect process. High %iowait is misleading: it means the CPU had nothing else to do while waiting for I/O, not that I/O is necessarily slow.

Gotcha: %iowait is the most misunderstood metric in Linux. It does NOT measure I/O pressure — it measures CPU idle time during which I/O was pending. If you add a CPU-bound process, %iowait drops even though I/O has not improved. Always cross-reference with iostat -x (await, %util) to confirm actual I/O pressure.

CPU Tuning Knobs

# Check current CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Set performance governor (skip frequency scaling)
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Disable transparent hugepages if they cause latency spikes (databases)
echo never > /sys/kernel/mm/transparent_hugepage/enabled

# Pin process to specific CPUs (avoid NUMA cross-node traffic)
taskset -c 0-3 /usr/bin/myapp

NUMA Awareness

On multi-socket servers, memory access is not uniform:

┌──────────┐         ┌──────────┐
│  CPU 0   │─local──▶│ Memory 0 │  ~80ns
│  Socket  │         │  Node 0  │
└────┬─────┘         └──────────┘
     │ interconnect (~150ns)
┌────┴─────┐         ┌──────────┐
│  CPU 1   │─local──▶│ Memory 1 │  ~80ns
│  Socket  │         │  Node 1  │
└──────────┘         └──────────┘
# Check NUMA topology
numactl --hardware

# Check NUMA memory allocation stats
numastat -c

# Run process on node 0 only
numactl --cpunodebind=0 --membind=0 /usr/bin/myapp

Memory Performance

Key Metrics

# Memory overview (free vs available is critical)
free -h

# Page faults, swap activity
vmstat 1

# Per-process memory breakdown
pmap -x $(pidof myapp)

# Slab allocator (kernel memory)
slabtop

The "Available" vs "Free" Trap

$ free -h
              total   used   free   shared  buff/cache  available
Mem:           16G    4.2G   512M     32M       11.3G      11.1G

free = truly unused (512M — looks alarming) available = free + reclaimable cache (11.1G — actually fine)

Linux uses spare memory for page cache. This is good. Do not panic when free is low.

Swap and Swappiness

# Check current swappiness (0-100)
cat /proc/sys/vm/swappiness

# Reduce swap tendency for databases/latency-sensitive workloads
sysctl vm.swappiness=10

# Make persistent
echo "vm.swappiness=10" >> /etc/sysctl.d/99-tuning.conf

Swappiness does not mean "swap at X% memory usage." It controls how aggressively the kernel reclaims anonymous pages vs page cache. Lower values prefer keeping application memory resident.

Huge Pages

# Check huge page allocation
grep Huge /proc/meminfo

# Reserve 1024 x 2MB huge pages
sysctl vm.nr_hugepages=1024

# Verify
cat /proc/meminfo | grep HugePages

Huge pages reduce TLB misses for large-memory applications (databases, JVMs). Transparent Huge Pages (THP) attempt to do this automatically but can cause latency spikes — most database vendors say disable THP.


Disk / I/O Performance

Key Metrics

# Per-device I/O stats (util%, await, queue depth)
iostat -xz 1

# Per-process I/O
iotop -oP

# Block layer latency
biolatency-bpfcc   # BCC tools

Reading iostat Output

Device   rrqm/s wrqm/s  r/s   w/s  rMB/s  wMB/s await  %util
sda        0.2    12.4  3.1 145.2   0.01   18.3   2.1   67.3
nvme0n1    0.0     0.0 12.5  89.4   0.80   45.2   0.3   12.1
  • %util > 70% on spinning disk = saturated. On NVMe, %util is misleading (parallel queues).
  • await = average I/O latency in ms. This is what applications feel.
  • rrqm/s, wrqm/s = merged requests. Low merging on sequential workloads = possible misalignment.

I/O Scheduler Tuning

# Check current scheduler
cat /sys/block/sda/queue/scheduler

# Set deadline for databases on spinning disk
echo deadline > /sys/block/sda/queue/scheduler

# For NVMe, use none (mq-deadline or none)
echo none > /sys/block/nvme0n1/queue/scheduler

# Increase queue depth for high-throughput workloads
echo 256 > /sys/block/sda/queue/nr_requests

Filesystem Tuning

# Mount with noatime (skip access time updates)
mount -o remount,noatime /data

# In /etc/fstab:
/dev/sda1  /data  ext4  defaults,noatime,discard  0 2

# XFS: check allocation group count
xfs_info /data

Network Performance

Key Metrics

# Interface stats (errors, drops, overruns)
ip -s link show eth0

# Socket buffer usage
ss -s

# TCP retransmits (packet loss indicator)
netstat -s | grep retransmit

# Network bandwidth test
iperf3 -c target-host -t 30

Sysctl Network Tuning

# Increase socket buffer sizes
sysctl net.core.rmem_max=16777216
sysctl net.core.wmem_max=16777216
sysctl net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl net.ipv4.tcp_wmem="4096 65536 16777216"

# Enable TCP BBR congestion control (better than cubic for most cases)
sysctl net.core.default_qdisc=fq
sysctl net.ipv4.tcp_congestion_control=bbr

# Increase connection backlog
sysctl net.core.somaxconn=65535
sysctl net.ipv4.tcp_max_syn_backlog=65535

# Increase ephemeral port range
sysctl net.ipv4.ip_local_port_range="1024 65535"

# Enable TCP timestamps and window scaling
sysctl net.ipv4.tcp_timestamps=1
sysctl net.ipv4.tcp_window_scaling=1

Interrupt Coalescing and RSS

# Check interrupt affinity
cat /proc/interrupts | grep eth0

# Distribute interrupts across CPUs (Receive Side Scaling)
ethtool -l eth0              # show channels
ethtool -L eth0 combined 8   # set 8 queues

# Check ring buffer sizes
ethtool -g eth0
ethtool -G eth0 rx 4096 tx 4096

Perf and Flamegraphs

perf is the Swiss Army knife of Linux profiling.

# Sample CPU stacks at 99Hz for 30 seconds
perf record -F 99 -ag -- sleep 30

# Generate report
perf report

# For flamegraphs (Brendan Gregg's tool)
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

Reading a Flamegraph

┌─────────────────────────────────────────────────┐
              application_handler                   <- top of stack
├───────────────────────┬─────────────────────────┤     (where CPU time
    json_parse             db_query                   is spent)
├──────────┬────────────┼────────┬────────────────┤
 malloc    str_copy    socket  query_compile  ├──────────┴────────────┴────────┴────────────────┤
                    main                            <- bottom of stack
└─────────────────────────────────────────────────┘
  Width = proportion of CPU time

Wide plateaus at the top = where the CPU is burning. Look for these first.


Strace for Syscall Analysis

# Trace a running process (summary mode)
strace -cp $(pidof myapp)

# Trace specific syscalls with timing
strace -e trace=open,read,write -T -p $(pidof myapp)

# Trace a new process with full paths
strace -f -e trace=file -o /tmp/trace.log ./myapp

The -c flag gives you a summary table showing which syscalls consume the most time — invaluable for narrowing down where an app is slow.


Sysctl Tuning Methodology

Do not cargo-cult sysctl values from blog posts. Follow this process:

  1. Baseline — measure current performance with a realistic workload
  2. Identify bottleneck — USE method across CPU, memory, disk, network
  3. Research the knob — read the kernel documentation, not a Medium post
  4. Change one thing — single variable at a time
  5. Measure again — same workload, same measurement method
  6. Persist or revert — if improvement confirmed, add to /etc/sysctl.d/
# Apply all sysctl configs
sysctl --system

# Verify a specific setting
sysctl net.ipv4.tcp_congestion_control

SAR for Historical Analysis

# Install sysstat and enable collection
systemctl enable --now sysstat

# CPU history for today
sar -u

# Memory history
sar -r

# Disk I/O history
sar -d

# Network history
sar -n DEV

# Specific time range
sar -u -s 14:00:00 -e 15:00:00

# Yesterday's data
sar -u -f /var/log/sa/sa$(date -d yesterday +%d)

SAR is your time machine. When someone asks "was the system slow yesterday at 2 PM?" — SAR answers that.

Remember: Mnemonic for the first 60 seconds checklist: "Up, Dmesg, VM, IO, Free, Top, Sockets, SAR" — or just remember uptime; dmesg -T | tail; vmstat 1 5; iostat -xz 1 5; free -h; top -bn1 | head; ss -s; sar -n DEV 1 5. These eight commands cover all four resource categories (CPU, memory, disk, network) and take under two minutes.


Quick Reference: Tool Selection

"CPU is high"         -> mpstat, perf top, pidstat
"Memory is low"       -> free -h, vmstat, slabtop
"Disk is slow"        -> iostat -xz, iotop, biolatency
"Network is slow"     -> ss, netstat -s, iperf3, ethtool
"App is slow"         -> strace -cp, perf record, flamegraph
"What happened?"      -> sar, dmesg, journalctl
"Everything is slow"  -> USE method, start with vmstat 1

Performance tuning is detective work. The tools give you evidence. The USE method gives you a systematic way to work through it. Do not skip the methodology and jump straight to sysctl knobs — that is how you make things worse.


Performance Triage

The Four Resources

Every performance problem is ultimately a bottleneck in one of four resources:

Resource Saturated means Key indicator
CPU Processes waiting for compute time Load average, %us+%sy in top
Memory System swapping or OOM killing si/so in vmstat, MemAvailable in /proc/meminfo
Disk I/O Processes blocked on read/write await in iostat, %iowait in top
Network Bandwidth saturated or connections dropped ss queue depths, sar -n DEV

First 60 Seconds Checklist

When you land on a slow box, run these in order:

uptime                  # load average trend (1/5/15 min)
dmesg -T | tail -20     # kernel errors, OOM kills, hardware issues
vmstat 1 5              # CPU, memory, swap, I/O overview
iostat -xz 1 5          # per-disk I/O stats
free -h                 # memory overview (look at "available")
top -bn1 | head -30     # top processes by CPU
ss -s                   # socket summary (connection counts)
sar -n DEV 1 5          # network throughput per interface

These eight commands tell you which resource is the bottleneck in under two minutes.

Triage Decision Tree

Symptom: "The system is slow"
  |
  +-> CPU bound?     -> top shows high %CPU, load average > core count
  +-> Memory bound?  -> free shows low available, vmstat shows swap activity
  +-> I/O bound?     -> iostat shows high await/util, top shows %iowait
  +-> Network bound? -> sar shows high throughput, ss shows full queues
  +-> None of these? -> It's probably not this machine. Check upstream dependencies.

Triage Tool Quick Reference

Tool What it shows When to use it
top/htop Process-level CPU/memory First look, find the offending process
vmstat System-wide CPU/memory/swap/IO Quick overview, detect swap pressure
iostat Per-disk I/O performance I/O bottleneck investigation
free Memory usage summary Memory pressure check
sar Historical performance data Trending, after-the-fact analysis
perf CPU profiling, flame graphs Deep CPU investigation
strace System call tracing "What is this process doing?"
/proc/* Kernel-exported runtime data When tools aren't installed

Triage Heuristics

  1. Check dmesg early. Hardware errors, OOM kills, and kernel panics show up here. Five seconds can save an hour of investigation.
  2. Load average is a blunt instrument. Always decompose it: is it CPU (r in vmstat) or I/O (b in vmstat)?
  3. Never trust a single metric. High CPU + high await = the CPU might look busy but it's actually waiting for disk.
  4. The "available" memory in free -h is the only memory number that matters for "are we running out of memory?"
  5. If everything looks fine on the box, the problem is elsewhere. Check upstream services, DNS, network path, and the calling application.
  6. Performance problems are often caused by recent changes. Ask "what changed?" before deep-diving into metrics.
  7. perf is the nuclear option for CPU issues. If top shows a process at 100% CPU but you don't know why, perf record -p <pid> -g for 10 seconds then perf report will show you the code path.

Wiki Navigation

Prerequisites

Next Steps