Portal | Level: L2: Operations | Topics: perf Profiling, Tracing | Domain: Linux
perf Profiling - Primer¶
Why This Matters¶
When a service is slow and logs show nothing wrong, you need to see where CPU time
is actually going. perf is the Linux kernel's built-in profiling and tracing tool.
It answers questions that no amount of log analysis can: which function is burning
cycles, whether the bottleneck is in userspace or kernel, and whether the system is
CPU-bound or waiting on I/O.
For production performance debugging, perf is often the first tool that gives you
real data instead of theories. It works on any compiled or interpreted language
running on Linux, because it samples at the CPU level rather than requiring
application-level instrumentation. A five-second perf top session can reveal
what hours of log reading cannot.
Understanding perf also builds intuition about how Linux actually executes your
workloads — cache behavior, context switch costs, and syscall overhead become
concrete numbers rather than abstract concepts.
Core Concepts¶
Who made it:
perfwas written by Thomas Gleixner, Ingo Molnar, and Peter Zijlstra and merged into the Linux kernel source tree in 2009 (kernel 2.6.31). It lives intools/perf/within the kernel repo — it is literally part of the kernel, not a third-party tool. This gives it direct access to kernel performance counters via theperf_event_opensyscall (added in Linux 2.6.31).Name origin:
perfis short for "performance counters." The underlying kernel subsystem was originally called "Performance Counters for Linux" (PCL) but was renamed to "perf_events" because it expanded beyond hardware counters to include software events, tracepoints, and probes.
1. perf Subcommands Overview¶
| Subcommand | Purpose | When to Use |
|---|---|---|
perf top |
Live hot-function view | Something is burning CPU right now |
perf record |
Collect samples to a file | Need to analyze offline or share |
perf report |
Read a recorded profile | After perf record |
perf stat |
Hardware counter summary | Quick characterization of a workload |
perf trace |
Syscall/event tracing | Lighter-weight alternative to strace |
2. perf top — Live Profiling¶
perf top shows a continuously-updated view of the hottest functions across the
system or a specific process. Think of it as top but for functions, not processes.
# System-wide hot functions (requires root)
sudo perf top
# Profile a specific process
sudo perf top -p <pid>
# Profile a specific process, user-space only
sudo perf top -p <pid> --no-children
Output looks like:
Samples: 12K of event 'cycles', 4000 Hz
Overhead Shared Object Symbol
18.42% libc.so.6 [.] __memcpy_avx2
9.31% myapp [.] parse_request
7.22% [kernel] [k] copy_user_enhanced
5.10% myapp [.] serialize_response
3.44% libpthread.so.0 [.] pthread_mutex_lock
The [.] means userspace, [k] means kernel. High kernel percentages suggest
heavy I/O, syscalls, or memory management — not necessarily a kernel bug.
3. perf record / perf report — Offline Profiling¶
For deeper analysis, record samples to a file and analyze later:
# Record samples from a running process (Ctrl+C to stop)
sudo perf record -p <pid>
# Record with call graphs (essential for understanding call chains)
sudo perf record --call-graph dwarf -p <pid>
# Record a specific command from start to finish
sudo perf record ./myapp --serve
# Analyze the recorded data
perf report
# Analyze with call graph expanded
perf report --call-graph
The default output file is perf.data in the current directory. It contains raw
sample data and can be large for long recording sessions.
4. perf stat — Hardware Counters¶
perf stat runs a command and reports aggregate hardware performance counters.
This gives you a high-level characterization of a workload in seconds.
# Basic counter summary
perf stat ./myapp --process-batch
# Detailed counters
perf stat -d ./myapp --process-batch
# Profile a running process for 10 seconds
sudo perf stat -p <pid> sleep 10
Output example:
Performance counter stats for './myapp --process-batch':
3,412.18 msec task-clock # 0.982 CPUs utilized
218 context-switches # 63.89 /sec
4 cpu-migrations # 1.17 /sec
12,481 page-faults # 3.66 K/sec
9,847,231,004 cycles # 2.89 GHz
7,123,456,789 instructions # 0.72 insn per cycle
1,234,567,890 branches # 361.76 M/sec
45,678,901 branch-misses # 3.70% of all branches
31,204 cache-misses # 12.41% of all cache refs
Remember: Mnemonic for reading
perf statoutput: CIP — Cycles per instruction (or IPC, its inverse), IPC below 1.0 means memory-bound, Page faults mean allocation churn. If IPC > 2.0, the code is compute-efficient. If CPUs utilized < 0.5, the process is waiting, not computing.
Key metrics to watch:
| Counter | What It Tells You |
|---|---|
task-clock / CPUs utilized |
Is the workload CPU-bound or waiting? |
context-switches |
High = contention or I/O heavy |
page-faults |
High = memory allocation churn or cold start |
instructions per cycle (IPC) |
< 1.0 = memory/branch stalls; > 2.0 = efficient |
cache-misses |
High % = data structures are cache-unfriendly |
branch-misses |
High % = unpredictable control flow |
5. perf trace — Syscall Tracing¶
perf trace is a lightweight alternative to strace for syscall tracing. It uses
the kernel's tracing infrastructure rather than ptrace, so it has lower overhead.
# Trace syscalls for a specific process
sudo perf trace -p <pid>
# Trace a command
sudo perf trace ls /tmp
# Trace only specific syscalls
sudo perf trace -e open,read,write,close ls /tmp
# Summary mode (like strace -c)
sudo perf trace -s ./myapp
Compared to strace:
| Aspect | strace | perf trace |
|---|---|---|
| Mechanism | ptrace (per-syscall stop) | Kernel tracepoints |
| Overhead | Higher (process stops per call) | Lower |
| Detail | Full argument decoding | Less argument detail |
| Attach | Any user can trace own process | Often needs root |
| Maturity | Battle-tested, ubiquitous | Newer, less portable |
6. Sampling vs Full Tracing¶
perf uses two fundamentally different data collection approaches. Choosing the wrong one wastes time or misleads you.
Sampling (perf top, perf record): A hardware timer fires at a fixed
rate (default ~4000 Hz). Each interrupt records the current instruction
pointer — which function was executing at that instant. After enough
samples, the functions that appear most often are the ones consuming the
most CPU. The key property: overhead is fixed regardless of workload
intensity, because the sample rate is constant.
Tracing (perf trace, perf stat): Hooks into kernel tracepoints or
hardware counters to record every occurrence of specific events — every
syscall, every context switch, every page fault. The data is complete
(no statistical gaps), but overhead scales linearly with event rate. A
process making 100K syscalls/sec generates 100K trace records/sec.
Sampling: ···X···X···X···X···X··· (periodic snapshots)
Tracing: XXXXX·XX··XXXX·XXX····X (every event captured)
When to use which:
| Question | Use | Why |
|---|---|---|
| Where is CPU time going? | Sampling (perf record) |
Statistical profile of hot functions |
| How many syscalls per second? | Tracing (perf trace -s) |
Need exact counts |
| Is the workload CPU or I/O bound? | Counters (perf stat) |
Quick characterization |
| What syscall is blocking? | Tracing (perf trace) |
Need to see the specific call |
| What function is 30% of CPU? | Sampling (perf top) |
Live hot-function view |
The most common mistake is using tracing when sampling would suffice — tracing a high-throughput service can add 20-40% overhead and generate gigabytes of data, while sampling the same service adds <2% overhead.
7. Userspace vs Kernel Time¶
When profiling shows a mix of userspace [.] and kernel [k] symbols,
the ratio tells you what category of problem you have:
High userspace % (>80% in application symbols): The application's own code is the bottleneck. Look at the hottest functions — are they doing unnecessary work, using a bad algorithm, or allocating excessively? This is the straightforward case.
High kernel % (>40% in [k] symbols):
The workload is spending significant time in kernel code. This does NOT
mean the kernel is broken. Common causes, in order of likelihood:
- Heavy I/O — lots of
read/write/sendmsgsyscalls, each crossing into kernel. Check withperf statfor high context switches andiostatfor disk utilization. - Memory pressure — frequent
page-fault,mmap,brkcalls. Checkperf statfor page-fault counts andvmstatfor swapping. - Lock contention — threads blocking on
futexin the kernel. Look forpthread_mutex_lockorfutexin the profile. - Excessive syscalls — thousands of small reads/writes instead of buffered I/O. The fix is batching, not kernel tuning.
High idle/wait (low task-clock / CPUs utilized in perf stat):
The process is not CPU-bound — it is waiting on I/O, network, or
locks. perf record will show mostly epoll_wait, poll, or
futex at the top. Profiling the CPU further is pointless; switch
to I/O tools (iostat, ss, tcpdump) or lock analysis.
8. Prerequisites for Good Profiles¶
Before profiling, verify these are in place:
| Requirement | Why | How to Check |
|---|---|---|
| Debug symbols | Readable function names vs hex addresses | file ./myapp shows "not stripped" |
-dbg packages |
Kernel/library symbols | apt install linux-tools-$(uname -r) |
| Sufficient runtime | Enough samples for significance | Record 10-30 seconds minimum |
| Correct PID | Profile the right process | pgrep -f myapp or pidof myapp |
| Permissions | perf needs access to counters | Root, or perf_event_paranoid sysctl |
| Container awareness | Host PID namespace differs from container | Map container PID to host PID |
For containers and Kubernetes:
# Find the host PID of a container process
PID=$(crictl inspect <container-id> | jq '.info.pid')
# Profile from the host
sudo perf top -p $PID
sudo perf record -p $PID -g -- sleep 30
# Kernel perf_event_paranoid setting
# 0 or 1 = normal users can profile; 2 = restricted; 3 = root only
cat /proc/sys/kernel/perf_event_paranoid
What Experienced People Know¶
- Start with
perf statbeforeperf record— the counters tell you whether the problem is CPU-bound, memory-bound, or I/O-bound before you invest time in detailed profiling perf topis the fastest path to "what is burning CPU right now" — it takes two seconds to get an answer, no recording or analysis step needed- Without debug symbols, perf output shows hex addresses instead of function
names — install
-dbgor-debuginfopackages before profiling, or the output is nearly useless - IPC (instructions per cycle) below 1.0 usually means the workload is stalled on memory access or branch mispredictions, not limited by CPU speed
- High context-switch counts suggest lock contention, too many threads, or heavy I/O multiplexing — not necessarily a scheduling problem
--call-graph dwarfgives accurate stack traces for most workloads but increases data file size significantly; use it when you need to understand call chains, skip it for quick checks- Profiling JVM, Python, or Node.js requires extra setup (perf maps, frame pointers, or JIT dump files) to get readable symbols — without this, you see only interpreter internals
-
perf recorddefaults to the current directory forperf.data— in production, use-o /tmp/perf.datato avoid filling application directoriesFun fact: Flame graphs were invented by Brendan Gregg at Joyent in 2011 while debugging a MySQL performance issue. He needed a way to visualize thousands of unique stack traces at once. The key insight: collapse identical stacks, sort alphabetically (not by time), and make the width proportional to sample count. The x-axis is not time — it is the alphabetical merge of stack frames. This is the most common misreading of flame graphs.
-
Flame graphs (via Brendan Gregg's tools) are the most effective way to visualize
perf recorddata —perf reportis useful but flame graphs reveal call-chain patterns much faster - In containers, you almost always need to profile from the host using the host-namespace PID, because containers typically lack the required capabilities and kernel access
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Related Content¶
- OpenTelemetry (Topic Pack, L2) — Tracing
- Performance Flashcards (CLI) (flashcard_deck, L1) — perf Profiling
- Tracing (Topic Pack, L1) — Tracing
- Tracing Flashcards (CLI) (flashcard_deck, L1) — Tracing
- strace (Topic Pack, L1) — Tracing