Linux Performance Tuning - Street-Level Ops¶

Quick Diagnosis Commands¶

When you get paged, run these first. In this order.

# 1. Load average and uptime (is this new or ongoing?)
uptime

# 2. One-second snapshots of everything (CPU, memory, swap, I/O)
vmstat 1 5

# 3. Per-CPU breakdown (spot single-core saturation)
mpstat -P ALL 1 3

# 4. Per-device I/O (find the slow disk)
iostat -xz 1 3

# 5. Top memory and CPU consumers
ps aux --sort=-%mem | head -20
ps aux --sort=-%cpu | head -20

# 6. Network connections (too many? stuck in TIME_WAIT?)
ss -s

# 7. Recent kernel messages (OOM killer? hardware errors?)
dmesg -T | tail -50

Pattern: The 60-Second Performance Checklist¶

Brendan Gregg's "60-second checklist" adapted for real incidents:

uptime                        # load average trend
dmesg -T | tail               # kernel errors, OOM
vmstat 1 5                    # run queue, swap, CPU balance
mpstat -P ALL 1 3             # per-CPU imbalance
pidstat 1 3                   # per-process CPU
iostat -xz 1 3               # disk utilization & latency
free -h                       # memory state
sar -n DEV 1 3               # network throughput
sar -n TCP,ETCP 1 3          # TCP retransmits
top                           # overall picture

Memorize this. Tattoo it. Whatever it takes.

Pattern: Generating Flamegraphs in Production¶

Flamegraphs turn "the app is slow" into "this function is slow."

# Install prerequisites
git clone https://github.com/brendangregg/FlameGraph.git
export PATH=$PATH:$(pwd)/FlameGraph

# CPU flamegraph — sample all CPUs for 30 seconds
perf record -F 99 -ag -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

# Off-CPU flamegraph (where processes WAIT)
# Requires BCC tools
offcputime-bpfcc -df -p $(pidof myapp) 30 | flamegraph.pl --color=io > offcpu.svg

# Specific process only
perf record -F 99 -g -p $(pidof myapp) -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > app.svg

Transfer the SVG to your laptop and open it in a browser. The interactive zooming is the whole point.

Pattern: Strace Like a Pro¶

# Summary: which syscalls take the most time?
strace -cp $(pidof nginx)
# Let it run 10 seconds, then Ctrl-C. Read the table.

# Trace slow syscalls only (>10ms)
strace -T -e trace=all -p $(pidof myapp) 2>&1 | awk '$NF+0 > 0.01'

# Trace file access patterns
strace -e trace=open,openat,stat,access -p $(pidof myapp)

# Trace network calls
strace -e trace=network -p $(pidof myapp)

# Follow child processes (important for forking servers)
strace -f -p $(pidof apache2)

# Write trace to file (avoid terminal overhead)
strace -f -o /tmp/trace.log -p $(pidof myapp)

Gotcha: vmstat Columns Are Confusing¶

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0      0 512000  32000 8192000    0    0    12   450  3200 5100 35 12 50  3  0

r = processes waiting for CPU. If r > number of CPUs, you are saturated.
b = processes in uninterruptible sleep (usually I/O). If b > 0 consistently, dig into disk.
si/so = swap in/out. Should be zero. If not, you are memory-constrained.
wa = iowait. Remember: this is misleading if the CPU has other work to do.
cs = context switches. High values (>100K/s) suggest too many threads fighting for CPUs.

Gotcha: %iowait Is a Lie¶

%iowait only increments when a CPU is idle AND waiting for I/O. If the CPU has other work to do while I/O is pending, iowait stays at 0. This means:

iowait = 0 does NOT mean your I/O is fast
iowait > 0 means I/O is happening AND the CPU had nothing else to do

Use iostat -xz and look at await instead — that is actual I/O latency.

Pattern: /proc Exploration¶

# CPU info (cores, model, flags)
cat /proc/cpuinfo | grep -c processor
cat /proc/cpuinfo | head -30

# Memory details
cat /proc/meminfo

# Per-process file descriptors
ls /proc/$(pidof myapp)/fd | wc -l

# Per-process memory map
cat /proc/$(pidof myapp)/maps | head -20

# Per-process I/O stats
cat /proc/$(pidof myapp)/io

# System-wide file handle usage
cat /proc/sys/fs/file-nr

# Kernel command line (boot parameters)
cat /proc/cmdline

# Current sysctl values (all of them)
sysctl -a 2>/dev/null | wc -l

Gotcha: perf record Overhead¶

perf record -F 99 at 99Hz is safe for production. Do not crank this to 9999Hz unless you want perf itself to become your bottleneck. The default 4000Hz is often too aggressive for production use.

# Safe for production
perf record -F 99 -ag -- sleep 30

# NOT safe for production (high overhead)
perf record -F 9999 -ag -- sleep 30

Pattern: Finding the Process That Is Eating Memory¶

# Top 10 by RSS (actual physical memory)
ps aux --sort=-%mem | head -11

# More precise: show proportional set size (PSS)
# PSS accounts for shared memory correctly
smem -rs pss | head -20

# If smem is not installed, use /proc
for pid in $(ls /proc | grep -E '^[0-9]+$' | head -50); do
  [ -f /proc/$pid/smaps_rollup ] && \
  echo "$pid $(cat /proc/$pid/comm 2>/dev/null) $(awk '/Pss/{sum+=$2}END{print sum}' /proc/$pid/smaps_rollup 2>/dev/null) kB"
done | sort -k3 -n -r | head -10

# Watch for OOM killer activity
dmesg -T | grep -i "out of memory"
journalctl -k | grep -i oom

Gotcha: The "Free Memory" Panic¶

Operators see this and panic:

$ free -h
              total   used   free   shared  buff/cache  available
Mem:           16G    4.2G   128M     32M       11.7G      11.2G

"We only have 128M free!" No. You have 11.2G available. Linux caches aggressively — the buff/cache column is reclaimable memory. The available column is what matters.

If available is low (say, under 10% of total), then worry.

Pattern: Quick Network Diagnosis¶

# Check for dropped packets
ip -s link show eth0 | grep -A1 "RX\|TX"

# TCP connection states (too many TIME_WAIT?)
ss -s

# Detailed TIME_WAIT count
ss -tan state time-wait | wc -l

# Check TCP retransmits (packet loss)
nstat -az | grep -i retrans

# Check socket buffer overflows
netstat -s | grep -i "buffer"
cat /proc/net/softnet_stat   # third column = time_squeeze

# Measure latency to a host
mtr -rwc 10 target-host

Gotcha: Tuning Without a Baseline¶

The number-one mistake in performance tuning: you change sysctl values and declare victory without measuring before and after. Always:

Run your benchmark/workload BEFORE changes
Record the numbers (latency p50/p99, throughput, error rate)
Make ONE change
Run the same benchmark
Compare

Without this discipline, you are just flipping switches and hoping.

Pattern: Disk I/O Deep Dive¶

# Which process is doing I/O?
iotop -oP

# What files is a process writing to?
lsof -p $(pidof myapp) | grep -E 'REG|DIR'

# Block layer latency distribution (BCC tools)
biolatency-bpfcc -D 10

# Watch for I/O scheduler issues
cat /sys/block/sda/queue/scheduler

# Check RAID status (if applicable)
cat /proc/mdstat
megacli -LDInfo -Lall -aALL  # LSI/Broadcom RAID

# Check filesystem fragmentation (ext4)
e4defrag -c /data

Pattern: Sysctl Changes That Actually Help¶

These are safe, well-understood changes for servers handling many connections:

# /etc/sysctl.d/99-server-tuning.conf

# Allow more connections
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535

# Faster recycling of TIME_WAIT sockets
net.ipv4.tcp_tw_reuse = 1

# Bigger socket buffers for high-bandwidth links
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

# Better congestion control
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq

# More file handles
fs.file-max = 2097152

# More ephemeral ports
net.ipv4.ip_local_port_range = 1024 65535

# Reduce swappiness for latency-sensitive workloads
vm.swappiness = 10

# Apply
sysctl --system

# Verify
sysctl net.ipv4.tcp_congestion_control

Gotcha: NUMA Effects on Database Performance¶

If your database runs on a 2-socket server and you have not pinned it to a NUMA node, half its memory accesses may be going cross-socket (80ns local vs 150ns remote).

# Check if NUMA is in play
numactl --hardware

# Check per-node memory usage
numastat -c $(pidof postgres)

# If numa_miss is high, you have a problem
numastat | grep miss

The fix is either numactl --interleave=all (spread evenly) for general workloads or numactl --cpunodebind=0 --membind=0 (pin to one node) for latency-sensitive single-instance databases.

Performance Triage: Which Tool to Grab First¶

Do NOT randomly open htop. Follow this decision path:

"Something is slow"
  |
  +-> uptime
  |   Load average tells you the magnitude.
  |   Load >> core count = something is wrong.
  |   Load ~ core count = system is busy but maybe normal.
  |   Load < core count = problem might not be CPU.
  |
  +-> vmstat 1 5
  |   Look at: r (run queue), b (blocked), si/so (swap in/out),
  |   us/sy/id/wa (CPU breakdown)
  |   High r -> CPU bottleneck
  |   High b -> I/O bottleneck
  |   si/so > 0 -> memory pressure
  |   wa > 0 -> I/O wait
  |
  +-> Based on vmstat, go deeper:
      CPU problem    -> top/htop, then perf if needed
      Memory problem -> free -h, then /proc/meminfo, then check OOM
      I/O problem    -> iostat -xz 1, then iotop
      Network problem -> ss -s, sar -n DEV 1

Triage: Reading `top` Like a Pro¶

The header matters more than the process list:

top - 14:23:01 up 45 days,  3:12,  2 users,  load average: 8.52, 7.31, 4.15
Tasks: 256 total,   3 running, 253 sleeping,   0 stopped,   0 zombie
%Cpu(s): 65.2 us,  8.3 sy,  0.0 ni, 20.1 id,  5.8 wa,  0.0 hi,  0.6 si,  0.0 st
MiB Mem :  15926.4 total,    234.2 free,  12456.8 used,   3235.4 buff/cache
MiB Swap:   4096.0 total,   2048.0 free,   2048.0 used.   2890.1 avail Mem

Key fields: - load average: 8.52 on an 8-core box = fully loaded. Rising trend (4.15 -> 7.31 -> 8.52) = getting worse. - us vs sy: us = user-space CPU. sy = kernel CPU. High sy = lots of syscalls, context switches, or kernel work. - wa (iowait): CPU is idle waiting for I/O. High wa = disk bottleneck. - st (steal): Time stolen by hypervisor. >5% = noisy neighbor or undersized VM. - avail Mem: The real "how much memory can I use" number. NOT "free" (free excludes reclaimable cache).

Process columns that matter: - %CPU: percentage of a single core. On 8 cores, max is 800%. - RES: actual physical memory used by the process. - S (state): R=running, S=sleeping, D=uninterruptible sleep (I/O wait -- a red flag if many processes are in D state).

Deep Dive: The `top` CPU Line Decoded¶

Every field in the %Cpu(s) line means something specific. Most engineers only look at us and id. The others are where the real production signals hide.

CPU Metrics — What Each Field Means¶

Metric	Full Name	What It Means	When to Care
`us`	User	User-space CPU (application code)	High = app is CPU-bound. Profile with `perf` or flamegraph.
`sy`	System	Kernel/system CPU (syscalls, drivers)	High (>20%) = excessive syscalls, context switches, or kernel lock contention. Investigate with `strace -c` or `perf top`.
`ni`	Nice	CPU used by nice'd (low-priority) processes	Rarely actionable. Shows renice'd background work.
`id`	Idle	CPU doing nothing	Low idle + low wa = genuine CPU saturation.
`wa`	I/O Wait	CPU idle because it has nothing to do except wait for I/O	High = storage bottleneck. Critical for database workloads. But see the iowait gotcha below.
`hi`	Hardware Interrupts	Time handling hardware interrupts (NIC, disk controller)	High = NIC saturation, device overload, or interrupt storm. Check `/proc/interrupts`.
`si`	Software Interrupts	Time handling softirqs (network packet processing, timers)	High = heavy network traffic, packet processing overhead. Key signal on container hosts (see below).
`st`	Steal	Time the hypervisor took from this VM for other VMs	High = noisy neighbor or overcommitted host. You cannot fix this from inside the VM.

Container-Specific Guidance¶

On container hosts (Kubernetes nodes, Docker hosts), the top CPU line shows host-level metrics. This is critical because container-aware monitoring tools (kubectl top, cAdvisor, Prometheus container_cpu_*) operate at a different layer and miss key signals.

si and st are invisible to container monitoring. These two metrics are the most likely to reveal problems that kubectl top or cAdvisor cannot see:

si (softirqs): All containers on a host share the kernel's softirq processing. A pod receiving a flood of network traffic drives up si on the host, degrading all other pods — but kubectl top shows each pod's own CPU usage, not the shared kernel overhead. If si is consistently above 10%, investigate with cat /proc/softirqs and look for NET_RX growth.
st (steal time): On cloud instances backing a Kubernetes cluster, steal time means the underlying hypervisor is overcommitted. Pods report normal CPU usage via cAdvisor, but actual wall-clock performance is degraded. If st > 5% on a node, the node itself is starved — no amount of pod resource tuning fixes this.

wa (iowait) maps to container I/O throttling:

High wa on the host often corresponds to containers hitting their blkio cgroup limits or contending for shared storage.
A container writing heavily to an emptyDir on the node's disk drives up host wa and affects every container on that node.
Check container I/O throttling: cat /sys/fs/cgroup/blkio/*/blkio.throttle.io_service_bytes

Host top vs container top:

Running top inside a container (without cgroup-aware procfs) shows the host's CPU counts and metrics — misleading in containers with CPU limits.
Newer kernels (5.x+) and runtimes expose cgroup-aware /proc/stat so top inside the container reflects its actual CPU allocation.
When in doubt, use cat /sys/fs/cgroup/cpu/cpuacct.usage from inside the container for accurate CPU time.

Per-core hotspots:

When top shows moderate average CPU but the app is slow, a single core may be saturated:

# Find per-core imbalance that averaged top hides
mpstat -P ALL 1 3

A core at 100% with others idle indicates single-threaded bottleneck or interrupt affinity problems (common with NICs pinned to one CPU).

Load Average Explained¶

load average: 8.52, 7.31, 4.15

The three numbers are the 1-minute, 5-minute, and 15-minute exponentially damped moving averages of the number of processes in runnable (R) or uninterruptible sleep (D) state.

Interpreting load average relative to CPU count is everything. On a 4-core system: - Load 4.0 = all cores busy, nothing waiting — at capacity but OK. - Load 8.0 = double the cores — processes are queuing. Users feel latency. - Load 0.5 = idle. If the app is slow, the problem is not CPU on this machine.

The key insight: High load + low CPU utilization = I/O bound. Processes are stuck in D state (waiting for disk/NFS/network), inflating the load average without consuming CPU cycles. Run vmstat 1 and check the b column (blocked processes) to confirm.

Trend tells the story: - 1-min > 5-min > 15-min (e.g., 8.5, 7.3, 4.1): Getting worse. Something changed recently. - 1-min < 5-min < 15-min (e.g., 2.1, 5.0, 7.3): Recovering. The spike is passing. - All three similar (e.g., 6.2, 6.0, 6.1): Sustained load. This is the baseline.

The Memory Line — The buff/cache Trap¶

MiB Mem :  15926.4 total,    234.2 free,  12456.8 used,   3235.4 buff/cache
MiB Swap:   4096.0 total,   2048.0 free,   2048.0 used.   2890.1 avail Mem

Do not look at "free" (234.2 MiB). Linux intentionally uses spare memory for page cache. Low free is normal and healthy.
Look at "avail Mem" (2890.1 MiB). This is the real answer to "how much memory can applications use?" It includes reclaimable cache and buffers.
buff/cache (3235.4 MiB): Buffers are metadata caches (block device info). Cache is the page cache (file data). Both are reclaimable under memory pressure — the kernel evicts them automatically when applications need the memory.
Swap used (2048.0 MiB): Non-zero swap is not automatically bad. Old unused pages get swapped to make room for cache. The question is: is swap actively churning? Check vmstat 1 — if si/so are zero, the swapped pages are dormant and not causing performance issues.

Process States — The D State Signal¶

In the process list, the S column shows state: - R (running): Actively using CPU or ready to run. - S (sleeping): Waiting for an event (normal for most processes). - D (uninterruptible sleep): Waiting for I/O to complete. This is the red flag.

Multiple processes in D state = an I/O problem is in progress. The processes cannot be killed (not even kill -9) because they are waiting for the kernel to complete an I/O operation. Common causes: - Disk hardware failure or high latency - NFS server not responding - SAN/iSCSI timeout - Filesystem corruption forcing synchronous I/O

To investigate D-state processes:

# What is the process waiting for?
cat /proc/<pid>/wchan    # kernel function it's blocked in
cat /proc/<pid>/stack    # full kernel stack trace

Triage: Reading `iostat` Correctly¶

iostat -xz 1 5
Device  r/s    w/s   rkB/s   wkB/s  await r_await w_await  %util
sda     0.50  45.00   4.00  512.00  15.20   1.50   15.35   72.00
sdb     0.00   0.10   0.00    0.80   0.50   0.00    0.50    0.10

What matters: - await: Average I/O latency (ms). <1ms = great (NVMe). 1-10ms = fine (SSD). 10-20ms = OK (HDD). >50ms = problem. - r_await/w_await: Read vs write latency separately. Helps identify if reads or writes are the problem. - %util: How busy the device is. On HDD, 100% = saturated. On SSD/NVMe, 100% != saturated (parallelism). Use await as the real indicator for SSDs. - r/s + w/s: IOPS. Compare to device capability (HDD ~150, SSD ~10K-100K, NVMe ~100K-1M).

The -z flag hides devices with zero activity. Very useful on systems with many disks.

First report is an average since boot. Ignore it.

Triage: Identifying Memory Pressure vs Cache¶

The number one mistake: looking at "free" memory and panicking.

free -h
              total        used        free      shared  buff/cache   available
Mem:           15Gi        12Gi       234Mi       456Mi        3.2Gi        2.8Gi
Swap:         4.0Gi        2.0Gi       2.0Gi

free (234Mi): DOES NOT mean you're out of memory. Linux uses free memory for page cache.
available (2.8Gi): THIS is your real headroom. It includes reclaimable cache.
buff/cache (3.2Gi): Memory used for caching. It can be reclaimed if applications need it.
Swap used (2.0Gi): If this is non-zero AND growing, you have memory pressure. If it's stable, old unused pages were swapped and it might be fine.

The definitive check:¶

# Are we actively swapping RIGHT NOW?
vmstat 1 5   # look at si/so columns
# si/so = 0 -> no active swapping, we're fine even if swap is used
# si/so > 0 -> active memory pressure

Triage: CPU Steal Time¶

st in top/vmstat = time the hypervisor took from your VM to serve other VMs.

0-2%: Normal.
2-5%: Watch it. Your VM host is busy.
5-10%: Performance impact likely. Talk to your cloud provider or host admin.
>10%: Serious. Migrate or resize.

You cannot fix steal time from inside the VM. It's a host-level problem.

Triage: Load Average Interpretation¶

Load average = number of processes in the run queue + processes waiting for I/O.

On a 4-core system: - Load 1.0: 25% utilized. Light load. - Load 4.0: 100% utilized. All cores busy but nothing waiting. - Load 8.0: 200% overloaded. 4 processes running, 4 waiting. Everything is slow. - Load 40.0: Something is very wrong. Usually I/O-related (many processes stuck in D state).

Load average includes I/O wait on Linux. A load of 40 on a 4-core system doesn't mean CPU. Check vmstat for b (blocked/IO-wait processes). If b is high, it's an I/O problem driving up the load average.

The three numbers (1/5/15 minute averages) show the trend: - Rising (1 > 5 > 15): Getting worse. - Falling (1 < 5 < 15): Recovering. - Stable: Sustained load.

Triage: The USE Method Applied¶

CPU¶

Check	Command
Utilization	`vmstat` us+sy, `top` %CPU
Saturation	`vmstat` r (run queue > core count)
Errors	`perf stat` for CPU errors (rare)

Memory¶

Check	Command
Utilization	`free -h` (look at available)
Saturation	`vmstat` si/so, `sar -B` (pgscand/s)
Errors	`dmesg \| grep oom`, `/proc/meminfo` (HardwareCorrupted)

Disk I/O¶

Check	Command
Utilization	`iostat -xz` %util
Saturation	`iostat -xz` avgqu-sz, await
Errors	`dmesg \| grep error`, `smartctl -H`

Network¶

Check	Command
Utilization	`sar -n DEV` rxkB/s txkB/s vs link speed
Saturation	`ss -s` (overflowed), `netstat -s \| grep retransmit`
Errors	`ip -s link` (errors, drops), `ethtool -S`

Triage: strace — The "What Is This Process Doing" Tool¶

# Attach to a running process
strace -p <pid> -f -e trace=file     # file operations only
strace -p <pid> -f -e trace=network  # network operations only
strace -p <pid> -f -c                # summary of syscall time
strace -p <pid> -f -T               # show time spent in each syscall

Common findings: - Process stuck in poll() or select() = waiting for something (network? file?). - Process doing thousands of open()/stat() = searching for files or misconfigured paths. - Process stuck in futex() = lock contention. - Process stuck in read() on a socket = waiting for remote response.

Warning: strace significantly slows the traced process. Don't use on production hot paths without awareness. perf trace is a lighter alternative.

Triage: /proc Filesystem — When Tools Aren't Installed¶

Per-process info:¶

cat /proc/<pid>/status     # process state, memory, threads
cat /proc/<pid>/io         # I/O statistics
cat /proc/<pid>/fd         # open file descriptors
ls /proc/<pid>/fd | wc -l  # count of open files
cat /proc/<pid>/limits     # resource limits (ulimits)
cat /proc/<pid>/cmdline | tr '\0' ' '  # full command line

System-wide:¶

cat /proc/meminfo          # detailed memory breakdown
cat /proc/loadavg          # load average
cat /proc/stat             # CPU time counters
cat /proc/diskstats        # I/O stats per device
cat /proc/net/dev          # network interface stats
cat /proc/net/snmp         # TCP/IP stack statistics

Triage: Common Misdiagnoses¶

"High CPU usage" that isn't a CPU problem¶

iowait (wa) counts as CPU usage in top. If %wa is high, it's a disk problem not CPU.
Software interrupts (si) high = network stack processing. Might be a NIC or driver issue.

"Memory leak" that isn't¶

Buffer/cache growth is normal. Check available, not free.
Java/Go/Python processes with large RSS that's mostly unused heap. Check pmap -x <pid> for detail.

"Disk is slow" when it's actually¶

Swap thrashing (memory problem causing I/O).
Network filesystem (NFS/CIFS) miscounted as local disk.
Log rotation or backup running in background.

"Network is slow" when it's actually¶

DNS resolution timeout (not bandwidth).
MTU mismatch causing fragmentation.
TCP retransmits from packet loss (not bandwidth).

Triage: Decision Tree — Process Stuck in D State¶

Process in D (uninterruptible sleep)
  |
  +-> Check what it's waiting for:
      cat /proc/<pid>/wchan   # kernel function it's blocked in
      cat /proc/<pid>/stack   # full kernel stack trace
  |
  +-> If many processes in D state:
      iostat -xz 1            # disk bottleneck?
      dmesg -T                # hardware errors? NFS hang?
  |
  +-> Common causes:
      - NFS server not responding (nfs_wait)
      - Disk hardware failure (blk_*)
      - Filesystem corruption (ext4_*, xfs_*)
      - iSCSI/SAN timeout

D-state processes CANNOT be killed (not even with SIGKILL). They are waiting for the kernel to complete an I/O operation. Fix the underlying I/O problem.

Quick Reference¶

Deep Dive: Linux Performance Debugging
Deep Dive: Linux Memory Management