Portal | Level: L2: Operations | Topics: Linux Fundamentals, Filesystems & Storage | Domain: Linux

Linux Performance Debugging¶

Scope¶

This document explains how to debug Linux performance issues systematically. It covers:

CPU bottlenecks
memory pressure
IO bottlenecks
scheduler and run queue issues
network bottlenecks
perf/eBPF/ftrace tooling concepts
practical step-by-step workflows

This is aimed at production troubleshooting, interview readiness, and learning how not to chase ghosts.

Big picture¶

Performance debugging is not "run top and guess." It is the process of identifying which subsystem is constraining useful work.

Main resource domains¶

CPU
memory
storage / IO
network
scheduler / contention
kernel/system call overhead
locking
external dependencies

A system can have low average CPU and still be slow because of:

memory reclaim
lock contention
blocked IO
run queue bursts
packet loss/retransmits
one hot thread on one core
cgroup throttling
scheduler latency

The first rule: define the symptom precisely¶

Ask:

high latency?
low throughput?
periodic stalls?
one process slow or whole node slow?
under load only or always?
user complaints tied to a clock/event/deploy?
CPU-bound or waiting-bound?

"Server is slow" is not a diagnosis. It is a cry for help.

USE method mindset¶

A helpful framing is to look at utilization, saturation, and errors for each resource.

CPU¶

utilization: how busy are cores
saturation: run queue, throttling, waiting to run
errors: less common directly, but thermal/power/events can matter

Memory¶

utilization: resident usage, page cache, slab
saturation: reclaim pressure, swap, OOM
errors: allocation failures, cgroup OOM

Disk¶

utilization: device busy time
saturation: queue depth, await, service time
errors: IO errors, retries

Network¶

utilization: bandwidth, packet rates
saturation: queue drops, softirq overload, conntrack limits
errors: drops, retransmits, checksum/path issues

CPU debugging¶

Questions to ask¶

are all CPUs busy or one core pinned?
is time in user, system, irq, softirq, steal, iowait?
are tasks runnable but not getting CPU?
is cgroup quota throttling the workload?

Baseline tools¶

top / htop
uptime
vmstat 1
mpstat -P ALL 1
pidstat -u 1

What to look for¶

high run queue relative to CPU count
one hot thread on a many-core box
high system CPU due to syscall/kernel overhead
high softirq for packet-heavy workload
steal time in virtualized environments
throttling in containers due to CPU quota

perf¶

perf lets you sample where CPU time goes.

Examples of questions it answers:

which functions consume CPU?
is time in userspace or kernel?
are we spinning on locks?
are syscalls hot?

This is one of the strongest tools for moving beyond guesswork.

Memory debugging¶

See the dedicated memory doc for full detail. Performance symptoms often show up as:

swap activity
direct reclaim
high major faults
OOM kills
NUMA imbalance
compaction stalls
cgroup memory throttling/pressure

Baseline tools¶

free -h
/proc/meminfo
vmstat 1
sar -B
PSI memory pressure if available

Red flags¶

increasing major page faults
swap in/out churn
elevated IO plus memory pressure
kswapd or direct reclaim activity
app latency coinciding with reclaim

IO debugging¶

IO pain often impersonates CPU or app bugs.

Questions to ask¶

is storage busy?
queue depth too high?
latency high on reads or writes?
random vs sequential workload?
filesystem or block layer issue?
networked storage path involved?

Baseline tools¶

iostat -xz 1
pidstat -d 1
iotop
sar -d
blktrace / btrace / fio for deeper work

Red flags¶

high %util
high await / svctm-type latency indicators
queue depth rising
one process dominating writes
dirty/writeback pressure in memory stats

Network debugging¶

A system can look "slow" because packets are delayed, dropped, or retransmitted.

Questions to ask¶

packet drops at NIC?
retransmits?
conntrack full?
one CPU handling too many RX interrupts?
softirq saturation?
MTU/fragmentation issue?
load balancer / service mesh / overlay path involved?

Tools¶

ip -s link
ethtool -S
ss -s
sar -n DEV 1
nstat
tcpdump
/proc/interrupts

Red flags¶

RX/TX drops
retransmit growth
softirq-heavy CPU usage
NIC queue imbalance
DNS latency masquerading as app latency

Scheduler and run queue issues¶

Sometimes the problem is not total CPU usage but CPU access latency.

Signs¶

high load average with modest CPU utilization
many runnable tasks
latency spikes under burst load
cgroup quota periods causing bursty throttling
lock contention causing herd behavior

Helpful tools¶

vmstat 1
pidstat -w
perf sched
PSI CPU pressure
systemd-cgtop for service-level resource context

Syscalls, kernel overhead, and locks¶

If an app spends too much time crossing user/kernel boundary or fighting locks, throughput tanks even when raw CPU seems available.

perf and eBPF/ftrace can reveal¶

hot syscalls
mutex contention
file descriptor churn
network stack overhead
allocator hotspots
scheduler latency

Typical root causes¶

logging too much
tiny IO operations
chatty network behavior
lock-heavy multithreading
bad polling loops
filesystem metadata storms

eBPF and tracing tools conceptually¶

You do not need to be an eBPF wizard to benefit from the concept.

eBPF lets you safely attach programs to kernel/user tracing points for visibility into:

syscalls
network events
scheduler activity
allocations
block IO
custom probes

Common ecosystems:

bcc tools
bpftrace
perf integration patterns

ftrace offers kernel tracing with lower-level focus.

These tools matter because averages hide latency sources. Tracing shows event paths.

Practical workflow¶

Step 1 - classify symptom domain¶

CPU-bound?
waiting on IO?
memory pressure?
network?
lock contention?
external dependency?

Step 2 - baseline the whole node¶

Collect:

CPU view
memory view
IO view
network view

Do not jump straight into app blame.

Step 3 - identify top offenders¶

Which:

process
thread
cgroup/service
device
interface
syscall/function

is actually hot or stalled?

Step 4 - zoom in with the right tool¶

CPU hotspot -> perf
reclaim/OOM -> memory stats
disk latency -> iostat, block tracing
packet loss -> network counters/capture
scheduler weirdness -> perf sched, PSI, run queue tools

Step 5 - correlate with time¶

Performance debugging is temporal. Ask what changed:

deploy?
traffic spike?
cron job?
backup?
compaction?
GC cycle?
noisy neighbor?

Common production failure patterns¶

1. High load average, CPU not fully busy¶

Could be:

blocked IO
runnable queue buildup
lock contention
D-state tasks
reclaim stalls

2. 100 percent CPU in one pod, node looks mostly fine¶

Likely single-thread or quota-local issue.

3. App latency spikes every few minutes¶

Possible causes:

GC
log rotate/compression
flush/writeback bursts
backup jobs
metrics scrapes too heavy
compaction/maintenance tasks

4. Throughput poor despite no obvious bottleneck¶

Could be:

lock contention
tiny synchronous IO
RTT/network retransmits
dependency latency
CPU branch/cache inefficiency seen only in profiling

5. Everything fine until traffic surge¶

Likely:

queue saturation
conntrack limit
NIC/IRQ imbalance
thread pool exhaustion
DB connection pool choke point
memory reclaim threshold crossing

Golden anti-patterns to avoid¶

diagnosing from one screenshot
staring at load average without context
equating low CPU with healthy system
ignoring cgroup limits
ignoring kernel logs
blaming app before checking system pressure
collecting averages only, not distributions
changing five things at once

Interview angles¶

Good questions hidden here:

how to approach performance debugging systematically
difference between CPU busy and CPU saturation
what perf is useful for
what iowait does and does not mean
how memory pressure can cause latency
what PSI is conceptually
how to debug softirq-heavy networking workloads
why load average is not a CPU usage metric

Strong answers emphasize method, not heroics.

Mental model to keep¶

Performance debugging is bottleneck localization.

You are trying to answer:

what resource or contention point limits useful work,
under what conditions,
for which workload,
and with what evidence?

Treat the system like a pipeline. Find the narrow section. Then prove it.

References¶

Prerequisites¶

Linux Ops (Topic Pack, L0)

Case Study: Disk Full Root Services Down (Case Study, L1) — Filesystems & Storage, Linux Fundamentals
Case Study: Runaway Logs Fill Disk (Case Study, L1) — Filesystems & Storage, Linux Fundamentals
Deep Dive: Linux Filesystem Internals (deep_dive, L2) — Filesystems & Storage, Linux Fundamentals
Disk & Storage Ops (Topic Pack, L1) — Filesystems & Storage, Linux Fundamentals
Kernel Troubleshooting (Topic Pack, L3) — Filesystems & Storage, Linux Fundamentals
/proc Filesystem (Topic Pack, L2) — Linux Fundamentals
Advanced Bash for Ops (Topic Pack, L1) — Linux Fundamentals
Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Linux Fundamentals
Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Linux Fundamentals
Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Linux Fundamentals

Linux Performance Debugging¶

Scope¶

Big picture¶

Main resource domains¶

The first rule: define the symptom precisely¶

USE method mindset¶

CPU¶

Memory¶

Disk¶

Network¶

CPU debugging¶

Questions to ask¶

Baseline tools¶

What to look for¶

perf¶

Memory debugging¶

Baseline tools¶

Red flags¶

IO debugging¶

Questions to ask¶

Baseline tools¶

Red flags¶

Network debugging¶

Questions to ask¶

Tools¶

Red flags¶

Scheduler and run queue issues¶

Signs¶

Helpful tools¶

Syscalls, kernel overhead, and locks¶

perf and eBPF/ftrace can reveal¶

Typical root causes¶

eBPF and tracing tools conceptually¶

Practical workflow¶

Step 1 - classify symptom domain¶

Step 2 - baseline the whole node¶

Step 3 - identify top offenders¶

Step 4 - zoom in with the right tool¶

Step 5 - correlate with time¶

Common production failure patterns¶

1. High load average, CPU not fully busy¶

2. 100 percent CPU in one pod, node looks mostly fine¶

3. App latency spikes every few minutes¶

4. Throughput poor despite no obvious bottleneck¶

5. Everything fine until traffic surge¶

Golden anti-patterns to avoid¶

Interview angles¶

Mental model to keep¶

References¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶