Skip to content

Portal | Level: L2: Operations | Topics: perf Profiling, Tracing | Domain: Linux

perf Profiling - Primer

Why This Matters

When a service is slow and logs show nothing wrong, you need to see where CPU time is actually going. perf is the Linux kernel's built-in profiling and tracing tool. It answers questions that no amount of log analysis can: which function is burning cycles, whether the bottleneck is in userspace or kernel, and whether the system is CPU-bound or waiting on I/O.

For production performance debugging, perf is often the first tool that gives you real data instead of theories. It works on any compiled or interpreted language running on Linux, because it samples at the CPU level rather than requiring application-level instrumentation. A five-second perf top session can reveal what hours of log reading cannot.

Understanding perf also builds intuition about how Linux actually executes your workloads — cache behavior, context switch costs, and syscall overhead become concrete numbers rather than abstract concepts.

Core Concepts

Who made it: perf was written by Thomas Gleixner, Ingo Molnar, and Peter Zijlstra and merged into the Linux kernel source tree in 2009 (kernel 2.6.31). It lives in tools/perf/ within the kernel repo — it is literally part of the kernel, not a third-party tool. This gives it direct access to kernel performance counters via the perf_event_open syscall (added in Linux 2.6.31).

Name origin: perf is short for "performance counters." The underlying kernel subsystem was originally called "Performance Counters for Linux" (PCL) but was renamed to "perf_events" because it expanded beyond hardware counters to include software events, tracepoints, and probes.

1. perf Subcommands Overview

Subcommand Purpose When to Use
perf top Live hot-function view Something is burning CPU right now
perf record Collect samples to a file Need to analyze offline or share
perf report Read a recorded profile After perf record
perf stat Hardware counter summary Quick characterization of a workload
perf trace Syscall/event tracing Lighter-weight alternative to strace

2. perf top — Live Profiling

perf top shows a continuously-updated view of the hottest functions across the system or a specific process. Think of it as top but for functions, not processes.

# System-wide hot functions (requires root)
sudo perf top

# Profile a specific process
sudo perf top -p <pid>

# Profile a specific process, user-space only
sudo perf top -p <pid> --no-children

Output looks like:

Samples: 12K of event 'cycles', 4000 Hz
Overhead  Shared Object        Symbol
  18.42%  libc.so.6            [.] __memcpy_avx2
   9.31%  myapp                [.] parse_request
   7.22%  [kernel]             [k] copy_user_enhanced
   5.10%  myapp                [.] serialize_response
   3.44%  libpthread.so.0      [.] pthread_mutex_lock

The [.] means userspace, [k] means kernel. High kernel percentages suggest heavy I/O, syscalls, or memory management — not necessarily a kernel bug.

3. perf record / perf report — Offline Profiling

For deeper analysis, record samples to a file and analyze later:

# Record samples from a running process (Ctrl+C to stop)
sudo perf record -p <pid>

# Record with call graphs (essential for understanding call chains)
sudo perf record --call-graph dwarf -p <pid>

# Record a specific command from start to finish
sudo perf record ./myapp --serve

# Analyze the recorded data
perf report

# Analyze with call graph expanded
perf report --call-graph

The default output file is perf.data in the current directory. It contains raw sample data and can be large for long recording sessions.

4. perf stat — Hardware Counters

perf stat runs a command and reports aggregate hardware performance counters. This gives you a high-level characterization of a workload in seconds.

# Basic counter summary
perf stat ./myapp --process-batch

# Detailed counters
perf stat -d ./myapp --process-batch

# Profile a running process for 10 seconds
sudo perf stat -p <pid> sleep 10

Output example:

 Performance counter stats for './myapp --process-batch':

       3,412.18 msec  task-clock               #  0.982 CPUs utilized
            218       context-switches          #  63.89 /sec
              4       cpu-migrations            #   1.17 /sec
         12,481       page-faults               #   3.66 K/sec
  9,847,231,004       cycles                    #   2.89 GHz
  7,123,456,789       instructions              #   0.72  insn per cycle
  1,234,567,890       branches                  #  361.76 M/sec
     45,678,901       branch-misses             #   3.70% of all branches
         31,204       cache-misses              #  12.41% of all cache refs

Remember: Mnemonic for reading perf stat output: CIPCycles per instruction (or IPC, its inverse), IPC below 1.0 means memory-bound, Page faults mean allocation churn. If IPC > 2.0, the code is compute-efficient. If CPUs utilized < 0.5, the process is waiting, not computing.

Key metrics to watch:

Counter What It Tells You
task-clock / CPUs utilized Is the workload CPU-bound or waiting?
context-switches High = contention or I/O heavy
page-faults High = memory allocation churn or cold start
instructions per cycle (IPC) < 1.0 = memory/branch stalls; > 2.0 = efficient
cache-misses High % = data structures are cache-unfriendly
branch-misses High % = unpredictable control flow

5. perf trace — Syscall Tracing

perf trace is a lightweight alternative to strace for syscall tracing. It uses the kernel's tracing infrastructure rather than ptrace, so it has lower overhead.

# Trace syscalls for a specific process
sudo perf trace -p <pid>

# Trace a command
sudo perf trace ls /tmp

# Trace only specific syscalls
sudo perf trace -e open,read,write,close ls /tmp

# Summary mode (like strace -c)
sudo perf trace -s ./myapp

Compared to strace:

Aspect strace perf trace
Mechanism ptrace (per-syscall stop) Kernel tracepoints
Overhead Higher (process stops per call) Lower
Detail Full argument decoding Less argument detail
Attach Any user can trace own process Often needs root
Maturity Battle-tested, ubiquitous Newer, less portable

6. Sampling vs Full Tracing

perf uses two fundamentally different data collection approaches. Choosing the wrong one wastes time or misleads you.

Sampling (perf top, perf record): A hardware timer fires at a fixed rate (default ~4000 Hz). Each interrupt records the current instruction pointer — which function was executing at that instant. After enough samples, the functions that appear most often are the ones consuming the most CPU. The key property: overhead is fixed regardless of workload intensity, because the sample rate is constant.

Tracing (perf trace, perf stat): Hooks into kernel tracepoints or hardware counters to record every occurrence of specific events — every syscall, every context switch, every page fault. The data is complete (no statistical gaps), but overhead scales linearly with event rate. A process making 100K syscalls/sec generates 100K trace records/sec.

Sampling:        ···X···X···X···X···X···  (periodic snapshots)
Tracing:         XXXXX·XX··XXXX·XXX····X  (every event captured)

When to use which:

Question Use Why
Where is CPU time going? Sampling (perf record) Statistical profile of hot functions
How many syscalls per second? Tracing (perf trace -s) Need exact counts
Is the workload CPU or I/O bound? Counters (perf stat) Quick characterization
What syscall is blocking? Tracing (perf trace) Need to see the specific call
What function is 30% of CPU? Sampling (perf top) Live hot-function view

The most common mistake is using tracing when sampling would suffice — tracing a high-throughput service can add 20-40% overhead and generate gigabytes of data, while sampling the same service adds <2% overhead.

7. Userspace vs Kernel Time

When profiling shows a mix of userspace [.] and kernel [k] symbols, the ratio tells you what category of problem you have:

High userspace % (>80% in application symbols): The application's own code is the bottleneck. Look at the hottest functions — are they doing unnecessary work, using a bad algorithm, or allocating excessively? This is the straightforward case.

High kernel % (>40% in [k] symbols): The workload is spending significant time in kernel code. This does NOT mean the kernel is broken. Common causes, in order of likelihood:

  1. Heavy I/O — lots of read/write/sendmsg syscalls, each crossing into kernel. Check with perf stat for high context switches and iostat for disk utilization.
  2. Memory pressure — frequent page-fault, mmap, brk calls. Check perf stat for page-fault counts and vmstat for swapping.
  3. Lock contention — threads blocking on futex in the kernel. Look for pthread_mutex_lock or futex in the profile.
  4. Excessive syscalls — thousands of small reads/writes instead of buffered I/O. The fix is batching, not kernel tuning.

High idle/wait (low task-clock / CPUs utilized in perf stat): The process is not CPU-bound — it is waiting on I/O, network, or locks. perf record will show mostly epoll_wait, poll, or futex at the top. Profiling the CPU further is pointless; switch to I/O tools (iostat, ss, tcpdump) or lock analysis.

8. Prerequisites for Good Profiles

Before profiling, verify these are in place:

Requirement Why How to Check
Debug symbols Readable function names vs hex addresses file ./myapp shows "not stripped"
-dbg packages Kernel/library symbols apt install linux-tools-$(uname -r)
Sufficient runtime Enough samples for significance Record 10-30 seconds minimum
Correct PID Profile the right process pgrep -f myapp or pidof myapp
Permissions perf needs access to counters Root, or perf_event_paranoid sysctl
Container awareness Host PID namespace differs from container Map container PID to host PID

For containers and Kubernetes:

# Find the host PID of a container process
PID=$(crictl inspect <container-id> | jq '.info.pid')

# Profile from the host
sudo perf top -p $PID
sudo perf record -p $PID -g -- sleep 30

# Kernel perf_event_paranoid setting
# 0 or 1 = normal users can profile; 2 = restricted; 3 = root only
cat /proc/sys/kernel/perf_event_paranoid

What Experienced People Know

  • Start with perf stat before perf record — the counters tell you whether the problem is CPU-bound, memory-bound, or I/O-bound before you invest time in detailed profiling
  • perf top is the fastest path to "what is burning CPU right now" — it takes two seconds to get an answer, no recording or analysis step needed
  • Without debug symbols, perf output shows hex addresses instead of function names — install -dbg or -debuginfo packages before profiling, or the output is nearly useless
  • IPC (instructions per cycle) below 1.0 usually means the workload is stalled on memory access or branch mispredictions, not limited by CPU speed
  • High context-switch counts suggest lock contention, too many threads, or heavy I/O multiplexing — not necessarily a scheduling problem
  • --call-graph dwarf gives accurate stack traces for most workloads but increases data file size significantly; use it when you need to understand call chains, skip it for quick checks
  • Profiling JVM, Python, or Node.js requires extra setup (perf maps, frame pointers, or JIT dump files) to get readable symbols — without this, you see only interpreter internals
  • perf record defaults to the current directory for perf.data — in production, use -o /tmp/perf.data to avoid filling application directories

    Fun fact: Flame graphs were invented by Brendan Gregg at Joyent in 2011 while debugging a MySQL performance issue. He needed a way to visualize thousands of unique stack traces at once. The key insight: collapse identical stacks, sort alphabetically (not by time), and make the width proportional to sample count. The x-axis is not time — it is the alphabetical merge of stack frames. This is the most common misreading of flame graphs.

  • Flame graphs (via Brendan Gregg's tools) are the most effective way to visualize perf record data — perf report is useful but flame graphs reveal call-chain patterns much faster

  • In containers, you almost always need to profile from the host using the host-namespace PID, because containers typically lack the required capabilities and kernel access

Wiki Navigation

Prerequisites

  • OpenTelemetry (Topic Pack, L2) — Tracing
  • Performance Flashcards (CLI) (flashcard_deck, L1) — perf Profiling
  • Tracing (Topic Pack, L1) — Tracing
  • Tracing Flashcards (CLI) (flashcard_deck, L1) — Tracing
  • strace (Topic Pack, L1) — Tracing