Skip to content

Portal | Level: L2: Operations | Topics: Linux Fundamentals, Filesystems & Storage | Domain: Linux

Linux Performance Debugging

Scope

This document explains how to debug Linux performance issues systematically. It covers:

  • CPU bottlenecks
  • memory pressure
  • IO bottlenecks
  • scheduler and run queue issues
  • network bottlenecks
  • perf/eBPF/ftrace tooling concepts
  • practical step-by-step workflows

This is aimed at production troubleshooting, interview readiness, and learning how not to chase ghosts.


Big picture

Performance debugging is not "run top and guess." It is the process of identifying which subsystem is constraining useful work.

Main resource domains

  • CPU
  • memory
  • storage / IO
  • network
  • scheduler / contention
  • kernel/system call overhead
  • locking
  • external dependencies

A system can have low average CPU and still be slow because of:

  • memory reclaim
  • lock contention
  • blocked IO
  • run queue bursts
  • packet loss/retransmits
  • one hot thread on one core
  • cgroup throttling
  • scheduler latency

The first rule: define the symptom precisely

Ask:

  • high latency?
  • low throughput?
  • periodic stalls?
  • one process slow or whole node slow?
  • under load only or always?
  • user complaints tied to a clock/event/deploy?
  • CPU-bound or waiting-bound?

"Server is slow" is not a diagnosis. It is a cry for help.


USE method mindset

A helpful framing is to look at utilization, saturation, and errors for each resource.

CPU

  • utilization: how busy are cores
  • saturation: run queue, throttling, waiting to run
  • errors: less common directly, but thermal/power/events can matter

Memory

  • utilization: resident usage, page cache, slab
  • saturation: reclaim pressure, swap, OOM
  • errors: allocation failures, cgroup OOM

Disk

  • utilization: device busy time
  • saturation: queue depth, await, service time
  • errors: IO errors, retries

Network

  • utilization: bandwidth, packet rates
  • saturation: queue drops, softirq overload, conntrack limits
  • errors: drops, retransmits, checksum/path issues

CPU debugging

Questions to ask

  • are all CPUs busy or one core pinned?
  • is time in user, system, irq, softirq, steal, iowait?
  • are tasks runnable but not getting CPU?
  • is cgroup quota throttling the workload?

Baseline tools

  • top / htop
  • uptime
  • vmstat 1
  • mpstat -P ALL 1
  • pidstat -u 1

What to look for

  • high run queue relative to CPU count
  • one hot thread on a many-core box
  • high system CPU due to syscall/kernel overhead
  • high softirq for packet-heavy workload
  • steal time in virtualized environments
  • throttling in containers due to CPU quota

perf

perf lets you sample where CPU time goes.

Examples of questions it answers:

  • which functions consume CPU?
  • is time in userspace or kernel?
  • are we spinning on locks?
  • are syscalls hot?

This is one of the strongest tools for moving beyond guesswork.


Memory debugging

See the dedicated memory doc for full detail. Performance symptoms often show up as:

  • swap activity
  • direct reclaim
  • high major faults
  • OOM kills
  • NUMA imbalance
  • compaction stalls
  • cgroup memory throttling/pressure

Baseline tools

  • free -h
  • /proc/meminfo
  • vmstat 1
  • sar -B
  • PSI memory pressure if available

Red flags

  • increasing major page faults
  • swap in/out churn
  • elevated IO plus memory pressure
  • kswapd or direct reclaim activity
  • app latency coinciding with reclaim

IO debugging

IO pain often impersonates CPU or app bugs.

Questions to ask

  • is storage busy?
  • queue depth too high?
  • latency high on reads or writes?
  • random vs sequential workload?
  • filesystem or block layer issue?
  • networked storage path involved?

Baseline tools

  • iostat -xz 1
  • pidstat -d 1
  • iotop
  • sar -d
  • blktrace / btrace / fio for deeper work

Red flags

  • high %util
  • high await / svctm-type latency indicators
  • queue depth rising
  • one process dominating writes
  • dirty/writeback pressure in memory stats

Network debugging

A system can look "slow" because packets are delayed, dropped, or retransmitted.

Questions to ask

  • packet drops at NIC?
  • retransmits?
  • conntrack full?
  • one CPU handling too many RX interrupts?
  • softirq saturation?
  • MTU/fragmentation issue?
  • load balancer / service mesh / overlay path involved?

Tools

  • ip -s link
  • ethtool -S
  • ss -s
  • sar -n DEV 1
  • nstat
  • tcpdump
  • /proc/interrupts

Red flags

  • RX/TX drops
  • retransmit growth
  • softirq-heavy CPU usage
  • NIC queue imbalance
  • DNS latency masquerading as app latency

Scheduler and run queue issues

Sometimes the problem is not total CPU usage but CPU access latency.

Signs

  • high load average with modest CPU utilization
  • many runnable tasks
  • latency spikes under burst load
  • cgroup quota periods causing bursty throttling
  • lock contention causing herd behavior

Helpful tools

  • vmstat 1
  • pidstat -w
  • perf sched
  • PSI CPU pressure
  • systemd-cgtop for service-level resource context

Syscalls, kernel overhead, and locks

If an app spends too much time crossing user/kernel boundary or fighting locks, throughput tanks even when raw CPU seems available.

perf and eBPF/ftrace can reveal

  • hot syscalls
  • mutex contention
  • file descriptor churn
  • network stack overhead
  • allocator hotspots
  • scheduler latency

Typical root causes

  • logging too much
  • tiny IO operations
  • chatty network behavior
  • lock-heavy multithreading
  • bad polling loops
  • filesystem metadata storms

eBPF and tracing tools conceptually

You do not need to be an eBPF wizard to benefit from the concept.

eBPF lets you safely attach programs to kernel/user tracing points for visibility into:

  • syscalls
  • network events
  • scheduler activity
  • allocations
  • block IO
  • custom probes

Common ecosystems:

  • bcc tools
  • bpftrace
  • perf integration patterns

ftrace offers kernel tracing with lower-level focus.

These tools matter because averages hide latency sources. Tracing shows event paths.


Practical workflow

Step 1 - classify symptom domain

  • CPU-bound?
  • waiting on IO?
  • memory pressure?
  • network?
  • lock contention?
  • external dependency?

Step 2 - baseline the whole node

Collect:

  • CPU view
  • memory view
  • IO view
  • network view

Do not jump straight into app blame.

Step 3 - identify top offenders

Which:

  • process
  • thread
  • cgroup/service
  • device
  • interface
  • syscall/function

is actually hot or stalled?

Step 4 - zoom in with the right tool

  • CPU hotspot -> perf
  • reclaim/OOM -> memory stats
  • disk latency -> iostat, block tracing
  • packet loss -> network counters/capture
  • scheduler weirdness -> perf sched, PSI, run queue tools

Step 5 - correlate with time

Performance debugging is temporal. Ask what changed:

  • deploy?
  • traffic spike?
  • cron job?
  • backup?
  • compaction?
  • GC cycle?
  • noisy neighbor?

Common production failure patterns

1. High load average, CPU not fully busy

Could be:

  • blocked IO
  • runnable queue buildup
  • lock contention
  • D-state tasks
  • reclaim stalls

2. 100 percent CPU in one pod, node looks mostly fine

Likely single-thread or quota-local issue.

3. App latency spikes every few minutes

Possible causes:

  • GC
  • log rotate/compression
  • flush/writeback bursts
  • backup jobs
  • metrics scrapes too heavy
  • compaction/maintenance tasks

4. Throughput poor despite no obvious bottleneck

Could be:

  • lock contention
  • tiny synchronous IO
  • RTT/network retransmits
  • dependency latency
  • CPU branch/cache inefficiency seen only in profiling

5. Everything fine until traffic surge

Likely:

  • queue saturation
  • conntrack limit
  • NIC/IRQ imbalance
  • thread pool exhaustion
  • DB connection pool choke point
  • memory reclaim threshold crossing

Golden anti-patterns to avoid

  • diagnosing from one screenshot
  • staring at load average without context
  • equating low CPU with healthy system
  • ignoring cgroup limits
  • ignoring kernel logs
  • blaming app before checking system pressure
  • collecting averages only, not distributions
  • changing five things at once

Interview angles

Good questions hidden here:

  • how to approach performance debugging systematically
  • difference between CPU busy and CPU saturation
  • what perf is useful for
  • what iowait does and does not mean
  • how memory pressure can cause latency
  • what PSI is conceptually
  • how to debug softirq-heavy networking workloads
  • why load average is not a CPU usage metric

Strong answers emphasize method, not heroics.


Mental model to keep

Performance debugging is bottleneck localization.

You are trying to answer:

  • what resource or contention point limits useful work,
  • under what conditions,
  • for which workload,
  • and with what evidence?

Treat the system like a pipeline. Find the narrow section. Then prove it.


References


Wiki Navigation

Prerequisites