Linux Performance Tuning Footguns¶

Tuning without benchmarking first. You copy sysctl values from a blog post, apply them, and declare the system faster. You never measured before or after. You have no idea if it helped, hurt, or did nothing. Fix: Always establish a baseline with a realistic workload before changing anything. Measure latency percentiles (p50, p95, p99) and throughput. Change one variable, re-measure, compare.

Remember: Brendan Gregg's USE method: for every resource, check Utilization, Saturation, and Errors. This gives you a systematic starting point instead of random top-staring. Resources: CPU, memory, disk I/O, network. Tools: vmstat (CPU/memory), iostat -xz (disk), sar -n DEV (network).

Cargo-culting sysctl values from the internet. That Medium post about "10 sysctl tweaks for 10x performance" was written for a specific workload on specific hardware in 2018. Your workload is different. Your kernel version is different. Fix: Understand what each sysctl does before applying it. Read the kernel documentation (sysctl-explorer.net or Documentation/networking/ip-sysctl.rst in the kernel source). Test on staging.
Ignoring NUMA on multi-socket servers. You run a database on a 2-socket server without NUMA awareness. Half of your memory accesses traverse the interconnect at nearly double the latency. Throughput drops 20-40% and you blame the application. Fix: Check numactl --hardware and numastat. Pin latency-sensitive workloads to a NUMA node or use --interleave=all for general-purpose workloads.
Panicking over low "free" memory. You see 128MB "free" on a 16GB box and start killing processes. Meanwhile, 11GB of page cache is perfectly reclaimable and the system is healthy. Fix: Look at the available column in free -h, not free. Linux uses spare memory for caching — this is by design and makes your system faster.
Treating %iowait as an I/O performance metric. You see 40% iowait and conclude your disks are slow. But iowait only means the CPU was idle AND waiting for I/O. It does not measure actual disk latency. Fix: Use iostat -xz and look at the await column for actual I/O latency. Use iotop to find which processes are doing I/O.
Cranking perf sampling frequency in production. You set perf record -F 9999 on a production server and perf itself becomes the performance problem. The overhead of sampling at high frequency can spike CPU usage. Fix: Use perf record -F 99 (99Hz) for production profiling. This gives plenty of data with negligible overhead. Save high-frequency sampling for dedicated test environments.
Disabling swap entirely on production servers. You set vm.swappiness=0 or remove swap, thinking it will prevent latency. Instead, when memory gets tight, the OOM killer starts shooting processes with no warning. Fix: Keep a small swap partition (1-2GB) as a safety valve. Set vm.swappiness=10 to make swapping unlikely but keep the escape hatch. Monitor with sar -W.
Using Transparent Huge Pages with databases. THP is enabled by default on most distros and it can cause unpredictable latency spikes with databases (PostgreSQL, MongoDB, Redis). Pages get defragmented in the background, causing stalls. Fix: Disable THP for database servers: echo never > /sys/kernel/mm/transparent_hugepage/enabled. Use explicit huge pages if the application supports them.
Tuning the wrong I/O scheduler for your storage. You set the cfq scheduler on an NVMe drive (which does not benefit from I/O sorting) or use none on a spinning disk (which needs it). Fix: Use none or mq-deadline for SSDs/NVMe. Use mq-deadline or bfq for spinning disks. Check with cat /sys/block/DEV/queue/scheduler.
Running strace on a hot production process without understanding overhead. strace uses ptrace, which stops the process for every syscall. On a high-throughput process, this can add 100x+ overhead and effectively cause an outage. Fix: Use strace -c (summary mode) for minimal overhead on production. For detailed tracing, use eBPF-based tools (bpftrace, perf trace) which have far less overhead. Save full strace for development or low-traffic processes.

Under the hood: strace uses ptrace(2), which context-switches the traced process to the kernel on every syscall entry and exit. A process doing 100,000 syscalls/sec under strace effectively doubles its context switches. eBPF-based alternatives (bpftrace, perf trace) attach probes in-kernel without stopping the target process, reducing overhead from 100x to 1-5%.

Performance Triage Footguns¶

Mistakes during live performance triage that cause outages or wasted hours.

11. Looking at CPU% first — chasing the wrong metric¶

The alert says "high CPU." You open htop and see processes in the 30-40% range, nothing alarming. You declare "CPU looks fine" and start looking elsewhere. Meanwhile, iowait is at 60% — the system is spending most of its time waiting for disk I/O, which shows up in load average and latency but not in per-process %CPU. The real bottleneck is the disk. Fix: Never start with CPU%. Start with vmstat 1 5 and uptime. Look at the load average and decompose it: r (run queue) vs b (blocked on I/O) in vmstat. If b is high, the problem is I/O, not CPU. If r is high, then look at CPU. Iowait in top's header (%wa) is your next check. CPU% per process is a late-stage tool, not a starting point.

Debug clue: Linux load average includes processes in uninterruptible sleep (D state) — waiting for disk I/O. A load average of 16 on a 4-core system looks alarming, but if vmstat shows r=3, b=13, only 3 processes want CPU. The other 13 are waiting for I/O. The fix is faster storage, not more CPU.

12. Ignoring iowait — treating an I/O problem as a CPU problem¶

top shows 70% CPU usage. You look at the per-process list and nothing obvious is consuming it. The CPU line shows: us=5%, sy=5%, id=20%, wa=70%. The 70% is iowait — the CPU is idle waiting for I/O, but top's default display groups this with "busy." You add more CPU capacity (scale up, bigger instance). I/O wait remains at 70% because the disk is the bottleneck, not the CPU count. Fix: In top, the wa column is the critical iowait field. Any wa above 10% warrants investigation. Switch to iostat -xz 1 to identify which device is saturated. Check await (I/O latency) and %util. High await with moderate %util often means the disk queue is full. Fix: faster storage, optimize I/O patterns, add caching.

13. Not separating user vs system CPU time — misidentifying the root cause¶

A service is consuming high CPU. You see 80% CPU usage total. You assume the application is the problem and start profiling the application code. But 60% of that is sy (system/kernel time), not us (user time). High sy indicates excessive syscalls, context switches, or kernel work — often caused by too many threads, network interrupt processing, or locking contention in the kernel. Application profiling won't find this. Fix: Check the CPU breakdown in vmstat or top header: us (user space), sy (kernel), ni (nice), id (idle), wa (iowait). High sy: investigate with perf stat -e context-switches,cs,migrations -p <pid> for context switches, or sar -u 1 for per-CPU breakdown. High si (software interrupts): likely network stack processing — check sar -n DEV and NIC interrupt affinity.

14. Using top instead of mpstat — missing per-CPU imbalance¶

You run top and see average CPU at 30%. Seems fine. But on a 16-core system, one core is at 100% and the rest are at 20%. A single-threaded process is CPU-bound but you can't see it in the average. The application feels slow despite "low" overall CPU. Similarly, interrupts from a NIC may all land on CPU 0, saturating it while others are idle. Fix: Use mpstat -P ALL 1 to see per-CPU utilization. Look for any core at or near 100% while others are idle. This indicates a single-threaded bottleneck or interrupt imbalance. For interrupt distribution: cat /proc/interrupts | head -20 to see interrupt counts per CPU. Fix single-threaded bottlenecks by parallelizing or scaling vertically (faster clock speed matters more than core count for single-threaded work). Fix interrupt imbalance with irqbalance or manual CPU affinity.

15. Measuring during the wrong window — collecting metrics after the problem resolves¶

The alert fires. You SSH in and run vmstat, iostat, top. Everything looks normal. You conclude "false alarm" and close the ticket. But the problem was a 30-second spike 2 minutes ago — by the time you logged in, the burst had finished. The customer-impacting latency spike has no forensic data. Fix: Implement continuous metrics collection before problems occur. sar collects system statistics every 10 minutes by default (configurable) and stores them in /var/log/sa/. Check historical data: sar -u -f /var/log/sa/sa<day> for CPU, sar -d for disk, sar -n DEV for network. Better: ship metrics to a time-series database (Prometheus, CloudWatch). When an alert fires, you should already have graphs — not run commands and hope the problem is still happening.

16. Forgetting memory pressure causes CPU stalls — diagnosing slowness as a CPU problem¶

The system is slow. CPU is high, but mostly from many processes running rather than one hog. You add CPU. The slowness continues. The actual problem is that the system is paging — swapping pages in and out causes every process to stall waiting for disk I/O, which shows up as increased CPU because more processes are in the run queue. More CPUs don't help; more memory (or less memory pressure) does. Fix: When CPU is elevated and distributed across many processes with no single hog, check memory: vmstat 1 5 and look at si/so (swap in/out). Any non-zero so indicates active memory pressure. Check free -h and look at "available" not "free." Check dmesg | grep -i oom for OOM killer events. Fix: reduce memory usage, add RAM, or tune the OOM killer (/proc/sys/vm/overcommit_memory, oom_score_adj).

17. Treating a performance issue as a code bug when it's infrastructure — optimizing the wrong layer¶

Response times are slow. Engineers spend two weeks optimizing database queries, adding indexes, rewriting algorithms. Response times improve 10%. The real bottleneck was a misconfigured network queue on the load balancer causing packet drops and TCP retransmissions, adding 200ms to every request. Or a cloud instance is on a noisy neighbor host with high steal time. No amount of code optimization overcomes a 200ms network penalty. Fix: Before profiling code, do a 10-minute infrastructure sanity check: (1) tailscale netcheck or equivalent — is the network path healthy? (2) vmstat — is this machine healthy? (3) Check steal time st in top — is the hypervisor taking CPU? (4) ss -s — are there excessive retransmits? (5) Check load balancer and upstream service response times in your monitoring. Only after ruling out infrastructure should you profile application code.

Gotcha: Cloud instance "steal time" (st in top) means the hypervisor is taking CPU cycles away from your VM for other tenants. Any st > 5% means you're on a noisy neighbor. This is invisible to application profiling — your code looks slow but the CPU is literally being stolen. The fix is to move to a dedicated instance type or file a support ticket, not to optimize your code.