Portal | Level: L2: Operations | Topics: Linux Performance Tuning, Linux Fundamentals, eBPF, Linux Ops Performance Triage | Domain: Linux

Linux Performance Tuning - Primer¶

Why This Matters¶

Every outage post-mortem eventually lands on the same question: "Why was the system slow?" If you run infrastructure, performance tuning is not optional — it is the difference between a 3 AM page and a quiet night. You do not need to be a kernel developer to tune Linux effectively, but you do need a methodology. Guessing is not a methodology.

This primer gives you a structured way to find bottlenecks, understand what the kernel is doing, and turn the right knobs. We cover CPU, memory, disk, and network — the four food groups of systems performance.

The USE Method¶

Who made it: Brendan Gregg developed the USE method while at Joyent (now Samsung) in 2012, and later popularized it in his book Systems Performance: Enterprise and the Cloud (2013, Prentice Hall). He is also the creator of flame graphs and the DTrace toolkit.

Brendan Gregg's USE method is the single most effective framework for performance analysis. For every resource, check three things:

Metric	What It Means	Example Tool
Utilization	How busy is the resource (% time busy)?	`mpstat`, `iostat`
Saturation	How much extra work is queued?	`vmstat` (r col)
Errors	Are there error events?	`dmesg`, `ethtool`

Work through this checklist for CPU, memory, disk, and network. If you find high utilization, dig deeper. If you find saturation, the resource is the bottleneck. If you find errors, fix them first — tuning a broken system is pointless.

Utilization vs Saturation¶

This distinction trips people up:

Utilization = 80% CPU busy
  -> System is working hard but possibly fine

Saturation = 12 processes in run queue on 4-core box
  -> Work is waiting. Users feel this as latency.

High utilization is not automatically a problem. Saturation always is.

CPU Performance¶

Key Metrics¶

# Overall CPU utilization per core
mpstat -P ALL 1

# Process-level CPU usage with threads
pidstat -t 1

# Run queue depth (first column 'r')
vmstat 1

# Scheduler latency
perf sched latency

Understanding CPU States¶

%usr   — user-space code execution
%sys   — kernel-space (syscalls, interrupts)
%iowait — CPU idle, waiting for I/O completion
%steal  — hypervisor took your cycles (VMs only)
%idle   — genuinely idle

High %sys often means excessive syscalls — strace the suspect process. High %iowait is misleading: it means the CPU had nothing else to do while waiting for I/O, not that I/O is necessarily slow.

Gotcha: %iowait is the most misunderstood metric in Linux. It does NOT measure I/O pressure — it measures CPU idle time during which I/O was pending. If you add a CPU-bound process, %iowait drops even though I/O has not improved. Always cross-reference with iostat -x (await, %util) to confirm actual I/O pressure.

CPU Tuning Knobs¶

# Check current CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Set performance governor (skip frequency scaling)
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Disable transparent hugepages if they cause latency spikes (databases)
echo never > /sys/kernel/mm/transparent_hugepage/enabled

# Pin process to specific CPUs (avoid NUMA cross-node traffic)
taskset -c 0-3 /usr/bin/myapp

NUMA Awareness¶

On multi-socket servers, memory access is not uniform:

┌──────────┐         ┌──────────┐
│  CPU 0   │─local──▶│ Memory 0 │  ~80ns
│  Socket  │         │  Node 0  │
└────┬─────┘         └──────────┘
     │ interconnect (~150ns)
┌────┴─────┐         ┌──────────┐
│  CPU 1   │─local──▶│ Memory 1 │  ~80ns
│  Socket  │         │  Node 1  │
└──────────┘         └──────────┘

# Check NUMA topology
numactl --hardware

# Check NUMA memory allocation stats
numastat -c

# Run process on node 0 only
numactl --cpunodebind=0 --membind=0 /usr/bin/myapp

Memory Performance¶

Key Metrics¶

# Memory overview (free vs available is critical)
free -h

# Page faults, swap activity
vmstat 1

# Per-process memory breakdown
pmap -x $(pidof myapp)

# Slab allocator (kernel memory)
slabtop

The "Available" vs "Free" Trap¶

$ free -h
              total   used   free   shared  buff/cache  available
Mem:           16G    4.2G   512M     32M       11.3G      11.1G

free = truly unused (512M — looks alarming) available = free + reclaimable cache (11.1G — actually fine)

Linux uses spare memory for page cache. This is good. Do not panic when free is low.

Swap and Swappiness¶

# Check current swappiness (0-100)
cat /proc/sys/vm/swappiness

# Reduce swap tendency for databases/latency-sensitive workloads
sysctl vm.swappiness=10

# Make persistent
echo "vm.swappiness=10" >> /etc/sysctl.d/99-tuning.conf

Swappiness does not mean "swap at X% memory usage." It controls how aggressively the kernel reclaims anonymous pages vs page cache. Lower values prefer keeping application memory resident.

Huge Pages¶

# Check huge page allocation
grep Huge /proc/meminfo

# Reserve 1024 x 2MB huge pages
sysctl vm.nr_hugepages=1024

# Verify
cat /proc/meminfo | grep HugePages

Huge pages reduce TLB misses for large-memory applications (databases, JVMs). Transparent Huge Pages (THP) attempt to do this automatically but can cause latency spikes — most database vendors say disable THP.

Disk / I/O Performance¶

Key Metrics¶

# Per-device I/O stats (util%, await, queue depth)
iostat -xz 1

# Per-process I/O
iotop -oP

# Block layer latency
biolatency-bpfcc   # BCC tools

Reading iostat Output¶

Device   rrqm/s wrqm/s  r/s   w/s  rMB/s  wMB/s await  %util
sda        0.2    12.4  3.1 145.2   0.01   18.3   2.1   67.3
nvme0n1    0.0     0.0 12.5  89.4   0.80   45.2   0.3   12.1

%util > 70% on spinning disk = saturated. On NVMe, %util is misleading (parallel queues).
await = average I/O latency in ms. This is what applications feel.
rrqm/s, wrqm/s = merged requests. Low merging on sequential workloads = possible misalignment.

I/O Scheduler Tuning¶

# Check current scheduler
cat /sys/block/sda/queue/scheduler

# Set deadline for databases on spinning disk
echo deadline > /sys/block/sda/queue/scheduler

# For NVMe, use none (mq-deadline or none)
echo none > /sys/block/nvme0n1/queue/scheduler

# Increase queue depth for high-throughput workloads
echo 256 > /sys/block/sda/queue/nr_requests

Filesystem Tuning¶

# Mount with noatime (skip access time updates)
mount -o remount,noatime /data

# In /etc/fstab:
/dev/sda1  /data  ext4  defaults,noatime,discard  0 2

# XFS: check allocation group count
xfs_info /data

Network Performance¶

Key Metrics¶

# Interface stats (errors, drops, overruns)
ip -s link show eth0

# Socket buffer usage
ss -s

# TCP retransmits (packet loss indicator)
netstat -s | grep retransmit

# Network bandwidth test
iperf3 -c target-host -t 30

Sysctl Network Tuning¶

# Increase socket buffer sizes
sysctl net.core.rmem_max=16777216
sysctl net.core.wmem_max=16777216
sysctl net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl net.ipv4.tcp_wmem="4096 65536 16777216"

# Enable TCP BBR congestion control (better than cubic for most cases)
sysctl net.core.default_qdisc=fq
sysctl net.ipv4.tcp_congestion_control=bbr

# Increase connection backlog
sysctl net.core.somaxconn=65535
sysctl net.ipv4.tcp_max_syn_backlog=65535

# Increase ephemeral port range
sysctl net.ipv4.ip_local_port_range="1024 65535"

# Enable TCP timestamps and window scaling
sysctl net.ipv4.tcp_timestamps=1
sysctl net.ipv4.tcp_window_scaling=1

Interrupt Coalescing and RSS¶

# Check interrupt affinity
cat /proc/interrupts | grep eth0

# Distribute interrupts across CPUs (Receive Side Scaling)
ethtool -l eth0              # show channels
ethtool -L eth0 combined 8   # set 8 queues

# Check ring buffer sizes
ethtool -g eth0
ethtool -G eth0 rx 4096 tx 4096

Perf and Flamegraphs¶

perf is the Swiss Army knife of Linux profiling.

# Sample CPU stacks at 99Hz for 30 seconds
perf record -F 99 -ag -- sleep 30

# Generate report
perf report

# For flamegraphs (Brendan Gregg's tool)
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

Reading a Flamegraph¶

┌─────────────────────────────────────────────────┐
│              application_handler                 │  <- top of stack
├───────────────────────┬─────────────────────────┤     (where CPU time
│    json_parse         │    db_query             │      is spent)
├──────────┬────────────┼────────┬────────────────┤
│ malloc   │ str_copy   │ socket │ query_compile  │
├──────────┴────────────┴────────┴────────────────┤
│                    main                          │  <- bottom of stack
└─────────────────────────────────────────────────┘
  Width = proportion of CPU time

Wide plateaus at the top = where the CPU is burning. Look for these first.

Strace for Syscall Analysis¶

# Trace a running process (summary mode)
strace -cp $(pidof myapp)

# Trace specific syscalls with timing
strace -e trace=open,read,write -T -p $(pidof myapp)

# Trace a new process with full paths
strace -f -e trace=file -o /tmp/trace.log ./myapp

The -c flag gives you a summary table showing which syscalls consume the most time — invaluable for narrowing down where an app is slow.

Sysctl Tuning Methodology¶

Do not cargo-cult sysctl values from blog posts. Follow this process:

Baseline — measure current performance with a realistic workload
Identify bottleneck — USE method across CPU, memory, disk, network
Research the knob — read the kernel documentation, not a Medium post
Change one thing — single variable at a time
Measure again — same workload, same measurement method
Persist or revert — if improvement confirmed, add to /etc/sysctl.d/

# Apply all sysctl configs
sysctl --system

# Verify a specific setting
sysctl net.ipv4.tcp_congestion_control

SAR for Historical Analysis¶

# Install sysstat and enable collection
systemctl enable --now sysstat

# CPU history for today
sar -u

# Memory history
sar -r

# Disk I/O history
sar -d

# Network history
sar -n DEV

# Specific time range
sar -u -s 14:00:00 -e 15:00:00

# Yesterday's data
sar -u -f /var/log/sa/sa$(date -d yesterday +%d)

SAR is your time machine. When someone asks "was the system slow yesterday at 2 PM?" — SAR answers that.

Remember: Mnemonic for the first 60 seconds checklist: "Up, Dmesg, VM, IO, Free, Top, Sockets, SAR" — or just remember uptime; dmesg -T | tail; vmstat 1 5; iostat -xz 1 5; free -h; top -bn1 | head; ss -s; sar -n DEV 1 5. These eight commands cover all four resource categories (CPU, memory, disk, network) and take under two minutes.

Quick Reference: Tool Selection¶

"CPU is high"         -> mpstat, perf top, pidstat
"Memory is low"       -> free -h, vmstat, slabtop
"Disk is slow"        -> iostat -xz, iotop, biolatency
"Network is slow"     -> ss, netstat -s, iperf3, ethtool
"App is slow"         -> strace -cp, perf record, flamegraph
"What happened?"      -> sar, dmesg, journalctl
"Everything is slow"  -> USE method, start with vmstat 1

Performance tuning is detective work. The tools give you evidence. The USE method gives you a systematic way to work through it. Do not skip the methodology and jump straight to sysctl knobs — that is how you make things worse.

Performance Triage¶

The Four Resources¶

Every performance problem is ultimately a bottleneck in one of four resources:

Resource	Saturated means	Key indicator
CPU	Processes waiting for compute time	Load average, `%us`+`%sy` in top
Memory	System swapping or OOM killing	`si/so` in vmstat, `MemAvailable` in /proc/meminfo
Disk I/O	Processes blocked on read/write	`await` in iostat, `%iowait` in top
Network	Bandwidth saturated or connections dropped	`ss` queue depths, `sar -n DEV`

First 60 Seconds Checklist¶

When you land on a slow box, run these in order:

uptime                  # load average trend (1/5/15 min)
dmesg -T | tail -20     # kernel errors, OOM kills, hardware issues
vmstat 1 5              # CPU, memory, swap, I/O overview
iostat -xz 1 5          # per-disk I/O stats
free -h                 # memory overview (look at "available")
top -bn1 | head -30     # top processes by CPU
ss -s                   # socket summary (connection counts)
sar -n DEV 1 5          # network throughput per interface

These eight commands tell you which resource is the bottleneck in under two minutes.

Triage Decision Tree¶

Symptom: "The system is slow"
  |
  +-> CPU bound?     -> top shows high %CPU, load average > core count
  +-> Memory bound?  -> free shows low available, vmstat shows swap activity
  +-> I/O bound?     -> iostat shows high await/util, top shows %iowait
  +-> Network bound? -> sar shows high throughput, ss shows full queues
  +-> None of these? -> It's probably not this machine. Check upstream dependencies.

Triage Tool Quick Reference¶

Tool	What it shows	When to use it
`top`/`htop`	Process-level CPU/memory	First look, find the offending process
`vmstat`	System-wide CPU/memory/swap/IO	Quick overview, detect swap pressure
`iostat`	Per-disk I/O performance	I/O bottleneck investigation
`free`	Memory usage summary	Memory pressure check
`sar`	Historical performance data	Trending, after-the-fact analysis
`perf`	CPU profiling, flame graphs	Deep CPU investigation
`strace`	System call tracing	"What is this process doing?"
`/proc/*`	Kernel-exported runtime data	When tools aren't installed

Triage Heuristics¶

Check dmesg early. Hardware errors, OOM kills, and kernel panics show up here. Five seconds can save an hour of investigation.
Load average is a blunt instrument. Always decompose it: is it CPU (r in vmstat) or I/O (b in vmstat)?
Never trust a single metric. High CPU + high await = the CPU might look busy but it's actually waiting for disk.
The "available" memory in free -h is the only memory number that matters for "are we running out of memory?"
If everything looks fine on the box, the problem is elsewhere. Check upstream services, DNS, network path, and the calling application.
Performance problems are often caused by recent changes. Ask "what changed?" before deep-diving into metrics.
perf is the nuclear option for CPU issues. If top shows a process at 100% CPU but you don't know why, perf record -p <pid> -g for 10 seconds then perf report will show you the code path.

Prerequisites¶

Linux Ops (Topic Pack, L0)

Next Steps¶

Kernel Troubleshooting (Topic Pack, L3)
Linux Kernel Tuning (Topic Pack, L2)

/proc Filesystem (Topic Pack, L2) — Linux Fundamentals, Linux Performance Tuning
Linux Memory Management (Topic Pack, L1) — Linux Fundamentals, Linux Performance Tuning
eBPF & Modern Linux Observability (Topic Pack, L3) — eBPF, Linux Fundamentals
Advanced Bash for Ops (Topic Pack, L1) — Linux Fundamentals
Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Linux Fundamentals
Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Linux Fundamentals
Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Linux Fundamentals
Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Linux Fundamentals
Case Study: Disk Full Root Services Down (Case Study, L1) — Linux Fundamentals
Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Linux Fundamentals

Linux Performance Tuning - Primer¶

Why This Matters¶

The USE Method¶

Utilization vs Saturation¶

CPU Performance¶

Key Metrics¶

Understanding CPU States¶

CPU Tuning Knobs¶

NUMA Awareness¶

Memory Performance¶

Key Metrics¶

The "Available" vs "Free" Trap¶

Swap and Swappiness¶

Huge Pages¶

Disk / I/O Performance¶

Key Metrics¶

Reading iostat Output¶

I/O Scheduler Tuning¶

Filesystem Tuning¶

Network Performance¶

Key Metrics¶

Sysctl Network Tuning¶

Interrupt Coalescing and RSS¶

Perf and Flamegraphs¶

Reading a Flamegraph¶

Strace for Syscall Analysis¶

Sysctl Tuning Methodology¶

SAR for Historical Analysis¶

Quick Reference: Tool Selection¶

Performance Triage¶

The Four Resources¶

First 60 Seconds Checklist¶

Triage Decision Tree¶

Triage Tool Quick Reference¶

Triage Heuristics¶

Wiki Navigation¶

Prerequisites¶

Next Steps¶

Pages that link here¶

Linux Performance Tuning - Primer¶

Why This Matters¶

The USE Method¶

Utilization vs Saturation¶

CPU Performance¶

Key Metrics¶

Understanding CPU States¶

CPU Tuning Knobs¶

NUMA Awareness¶

Memory Performance¶

Key Metrics¶

The "Available" vs "Free" Trap¶

Swap and Swappiness¶

Huge Pages¶

Disk / I/O Performance¶

Key Metrics¶

Reading iostat Output¶

I/O Scheduler Tuning¶

Filesystem Tuning¶

Network Performance¶

Key Metrics¶

Sysctl Network Tuning¶

Interrupt Coalescing and RSS¶

Perf and Flamegraphs¶

Reading a Flamegraph¶

Strace for Syscall Analysis¶

Sysctl Tuning Methodology¶

SAR for Historical Analysis¶

Quick Reference: Tool Selection¶

Performance Triage¶

The Four Resources¶

First 60 Seconds Checklist¶

Triage Decision Tree¶

Triage Tool Quick Reference¶

Triage Heuristics¶

Wiki Navigation¶

Prerequisites¶

Next Steps¶

Related Content¶

Pages that link here¶