Portal | Level: L2: Operations | Topics: Linux Fundamentals, Filesystems & Storage | Domain: Linux
Linux Performance Debugging¶
Scope¶
This document explains how to debug Linux performance issues systematically. It covers:
- CPU bottlenecks
- memory pressure
- IO bottlenecks
- scheduler and run queue issues
- network bottlenecks
- perf/eBPF/ftrace tooling concepts
- practical step-by-step workflows
This is aimed at production troubleshooting, interview readiness, and learning how not to chase ghosts.
Big picture¶
Performance debugging is not "run top and guess." It is the process of identifying which subsystem is constraining useful work.
Main resource domains¶
- CPU
- memory
- storage / IO
- network
- scheduler / contention
- kernel/system call overhead
- locking
- external dependencies
A system can have low average CPU and still be slow because of:
- memory reclaim
- lock contention
- blocked IO
- run queue bursts
- packet loss/retransmits
- one hot thread on one core
- cgroup throttling
- scheduler latency
The first rule: define the symptom precisely¶
Ask:
- high latency?
- low throughput?
- periodic stalls?
- one process slow or whole node slow?
- under load only or always?
- user complaints tied to a clock/event/deploy?
- CPU-bound or waiting-bound?
"Server is slow" is not a diagnosis. It is a cry for help.
USE method mindset¶
A helpful framing is to look at utilization, saturation, and errors for each resource.
CPU¶
- utilization: how busy are cores
- saturation: run queue, throttling, waiting to run
- errors: less common directly, but thermal/power/events can matter
Memory¶
- utilization: resident usage, page cache, slab
- saturation: reclaim pressure, swap, OOM
- errors: allocation failures, cgroup OOM
Disk¶
- utilization: device busy time
- saturation: queue depth, await, service time
- errors: IO errors, retries
Network¶
- utilization: bandwidth, packet rates
- saturation: queue drops, softirq overload, conntrack limits
- errors: drops, retransmits, checksum/path issues
CPU debugging¶
Questions to ask¶
- are all CPUs busy or one core pinned?
- is time in user, system, irq, softirq, steal, iowait?
- are tasks runnable but not getting CPU?
- is cgroup quota throttling the workload?
Baseline tools¶
top/htopuptimevmstat 1mpstat -P ALL 1pidstat -u 1
What to look for¶
- high run queue relative to CPU count
- one hot thread on a many-core box
- high system CPU due to syscall/kernel overhead
- high softirq for packet-heavy workload
- steal time in virtualized environments
- throttling in containers due to CPU quota
perf¶
perf lets you sample where CPU time goes.
Examples of questions it answers:
- which functions consume CPU?
- is time in userspace or kernel?
- are we spinning on locks?
- are syscalls hot?
This is one of the strongest tools for moving beyond guesswork.
Memory debugging¶
See the dedicated memory doc for full detail. Performance symptoms often show up as:
- swap activity
- direct reclaim
- high major faults
- OOM kills
- NUMA imbalance
- compaction stalls
- cgroup memory throttling/pressure
Baseline tools¶
free -h/proc/meminfovmstat 1sar -B- PSI memory pressure if available
Red flags¶
- increasing major page faults
- swap in/out churn
- elevated IO plus memory pressure
kswapdor direct reclaim activity- app latency coinciding with reclaim
IO debugging¶
IO pain often impersonates CPU or app bugs.
Questions to ask¶
- is storage busy?
- queue depth too high?
- latency high on reads or writes?
- random vs sequential workload?
- filesystem or block layer issue?
- networked storage path involved?
Baseline tools¶
iostat -xz 1pidstat -d 1iotopsar -dblktrace/btrace/fiofor deeper work
Red flags¶
- high
%util - high await / svctm-type latency indicators
- queue depth rising
- one process dominating writes
- dirty/writeback pressure in memory stats
Network debugging¶
A system can look "slow" because packets are delayed, dropped, or retransmitted.
Questions to ask¶
- packet drops at NIC?
- retransmits?
- conntrack full?
- one CPU handling too many RX interrupts?
- softirq saturation?
- MTU/fragmentation issue?
- load balancer / service mesh / overlay path involved?
Tools¶
ip -s linkethtool -Sss -ssar -n DEV 1nstattcpdump/proc/interrupts
Red flags¶
- RX/TX drops
- retransmit growth
- softirq-heavy CPU usage
- NIC queue imbalance
- DNS latency masquerading as app latency
Scheduler and run queue issues¶
Sometimes the problem is not total CPU usage but CPU access latency.
Signs¶
- high load average with modest CPU utilization
- many runnable tasks
- latency spikes under burst load
- cgroup quota periods causing bursty throttling
- lock contention causing herd behavior
Helpful tools¶
vmstat 1pidstat -wperf sched- PSI CPU pressure
systemd-cgtopfor service-level resource context
Syscalls, kernel overhead, and locks¶
If an app spends too much time crossing user/kernel boundary or fighting locks, throughput tanks even when raw CPU seems available.
perf and eBPF/ftrace can reveal¶
- hot syscalls
- mutex contention
- file descriptor churn
- network stack overhead
- allocator hotspots
- scheduler latency
Typical root causes¶
- logging too much
- tiny IO operations
- chatty network behavior
- lock-heavy multithreading
- bad polling loops
- filesystem metadata storms
eBPF and tracing tools conceptually¶
You do not need to be an eBPF wizard to benefit from the concept.
eBPF lets you safely attach programs to kernel/user tracing points for visibility into:
- syscalls
- network events
- scheduler activity
- allocations
- block IO
- custom probes
Common ecosystems:
- bcc tools
- bpftrace
- perf integration patterns
ftrace offers kernel tracing with lower-level focus.
These tools matter because averages hide latency sources. Tracing shows event paths.
Practical workflow¶
Step 1 - classify symptom domain¶
- CPU-bound?
- waiting on IO?
- memory pressure?
- network?
- lock contention?
- external dependency?
Step 2 - baseline the whole node¶
Collect:
- CPU view
- memory view
- IO view
- network view
Do not jump straight into app blame.
Step 3 - identify top offenders¶
Which:
- process
- thread
- cgroup/service
- device
- interface
- syscall/function
is actually hot or stalled?
Step 4 - zoom in with the right tool¶
- CPU hotspot ->
perf - reclaim/OOM -> memory stats
- disk latency ->
iostat, block tracing - packet loss -> network counters/capture
- scheduler weirdness ->
perf sched, PSI, run queue tools
Step 5 - correlate with time¶
Performance debugging is temporal. Ask what changed:
- deploy?
- traffic spike?
- cron job?
- backup?
- compaction?
- GC cycle?
- noisy neighbor?
Common production failure patterns¶
1. High load average, CPU not fully busy¶
Could be:
- blocked IO
- runnable queue buildup
- lock contention
- D-state tasks
- reclaim stalls
2. 100 percent CPU in one pod, node looks mostly fine¶
Likely single-thread or quota-local issue.
3. App latency spikes every few minutes¶
Possible causes:
- GC
- log rotate/compression
- flush/writeback bursts
- backup jobs
- metrics scrapes too heavy
- compaction/maintenance tasks
4. Throughput poor despite no obvious bottleneck¶
Could be:
- lock contention
- tiny synchronous IO
- RTT/network retransmits
- dependency latency
- CPU branch/cache inefficiency seen only in profiling
5. Everything fine until traffic surge¶
Likely:
- queue saturation
- conntrack limit
- NIC/IRQ imbalance
- thread pool exhaustion
- DB connection pool choke point
- memory reclaim threshold crossing
Golden anti-patterns to avoid¶
- diagnosing from one screenshot
- staring at load average without context
- equating low CPU with healthy system
- ignoring cgroup limits
- ignoring kernel logs
- blaming app before checking system pressure
- collecting averages only, not distributions
- changing five things at once
Interview angles¶
Good questions hidden here:
- how to approach performance debugging systematically
- difference between CPU busy and CPU saturation
- what
perfis useful for - what iowait does and does not mean
- how memory pressure can cause latency
- what PSI is conceptually
- how to debug softirq-heavy networking workloads
- why load average is not a CPU usage metric
Strong answers emphasize method, not heroics.
Mental model to keep¶
Performance debugging is bottleneck localization.
You are trying to answer:
- what resource or contention point limits useful work,
- under what conditions,
- for which workload,
- and with what evidence?
Treat the system like a pipeline. Find the narrow section. Then prove it.
References¶
- perf_event_open(2)
- bpf(2)
- bpf-helpers(7)
- Linux /proc filesystem docs
- Prometheus overview
- Prometheus storage
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Related Content¶
- Case Study: Disk Full Root Services Down (Case Study, L1) — Filesystems & Storage, Linux Fundamentals
- Case Study: Runaway Logs Fill Disk (Case Study, L1) — Filesystems & Storage, Linux Fundamentals
- Deep Dive: Linux Filesystem Internals (deep_dive, L2) — Filesystems & Storage, Linux Fundamentals
- Disk & Storage Ops (Topic Pack, L1) — Filesystems & Storage, Linux Fundamentals
- Kernel Troubleshooting (Topic Pack, L3) — Filesystems & Storage, Linux Fundamentals
- /proc Filesystem (Topic Pack, L2) — Linux Fundamentals
- Advanced Bash for Ops (Topic Pack, L1) — Linux Fundamentals
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Linux Fundamentals
- Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Linux Fundamentals
- Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Linux Fundamentals
Pages that link here¶
- Disk & Storage Ops
- Disk & Storage Ops Primer
- Disk Full Root - Services Down
- Inodes - Primer
- Kernel Troubleshooting
- Kernel Troubleshooting - Primer
- Linux Filesystem Internals
- Linux Performance Tuning - Street-Level Ops
- Make & Build Systems
- NVMe Drive Disappeared After Reboot
- Pipes & Redirection
- Primer
- Runbook: Disk Full
- Runbook: PostgreSQL Disk Space Critical
- Symptoms