Portal | Level: L2: Operations | Topics: perf Profiling, Tracing | Domain: Linux

perf Profiling - Primer¶

Why This Matters¶

When a service is slow and logs show nothing wrong, you need to see where CPU time is actually going. perf is the Linux kernel's built-in profiling and tracing tool. It answers questions that no amount of log analysis can: which function is burning cycles, whether the bottleneck is in userspace or kernel, and whether the system is CPU-bound or waiting on I/O.

For production performance debugging, perf is often the first tool that gives you real data instead of theories. It works on any compiled or interpreted language running on Linux, because it samples at the CPU level rather than requiring application-level instrumentation. A five-second perf top session can reveal what hours of log reading cannot.

Understanding perf also builds intuition about how Linux actually executes your workloads — cache behavior, context switch costs, and syscall overhead become concrete numbers rather than abstract concepts.

Core Concepts¶

Who made it: perf was written by Thomas Gleixner, Ingo Molnar, and Peter Zijlstra and merged into the Linux kernel source tree in 2009 (kernel 2.6.31). It lives in tools/perf/ within the kernel repo — it is literally part of the kernel, not a third-party tool. This gives it direct access to kernel performance counters via the perf_event_open syscall (added in Linux 2.6.31).

Name origin: perf is short for "performance counters." The underlying kernel subsystem was originally called "Performance Counters for Linux" (PCL) but was renamed to "perf_events" because it expanded beyond hardware counters to include software events, tracepoints, and probes.

1. perf Subcommands Overview¶

Subcommand	Purpose	When to Use
`perf top`	Live hot-function view	Something is burning CPU right now
`perf record`	Collect samples to a file	Need to analyze offline or share
`perf report`	Read a recorded profile	After `perf record`
`perf stat`	Hardware counter summary	Quick characterization of a workload
`perf trace`	Syscall/event tracing	Lighter-weight alternative to strace

2. perf top — Live Profiling¶

perf top shows a continuously-updated view of the hottest functions across the system or a specific process. Think of it as top but for functions, not processes.

# System-wide hot functions (requires root)
sudo perf top

# Profile a specific process
sudo perf top -p <pid>

# Profile a specific process, user-space only
sudo perf top -p <pid> --no-children

Output looks like:

Samples: 12K of event 'cycles', 4000 Hz
Overhead  Shared Object        Symbol
  18.42%  libc.so.6            [.] __memcpy_avx2
   9.31%  myapp                [.] parse_request
   7.22%  [kernel]             [k] copy_user_enhanced
   5.10%  myapp                [.] serialize_response
   3.44%  libpthread.so.0      [.] pthread_mutex_lock

The [.] means userspace, [k] means kernel. High kernel percentages suggest heavy I/O, syscalls, or memory management — not necessarily a kernel bug.

3. perf record / perf report — Offline Profiling¶

For deeper analysis, record samples to a file and analyze later:

# Record samples from a running process (Ctrl+C to stop)
sudo perf record -p <pid>

# Record with call graphs (essential for understanding call chains)
sudo perf record --call-graph dwarf -p <pid>

# Record a specific command from start to finish
sudo perf record ./myapp --serve

# Analyze the recorded data
perf report

# Analyze with call graph expanded
perf report --call-graph

The default output file is perf.data in the current directory. It contains raw sample data and can be large for long recording sessions.

4. perf stat — Hardware Counters¶

perf stat runs a command and reports aggregate hardware performance counters. This gives you a high-level characterization of a workload in seconds.

# Basic counter summary
perf stat ./myapp --process-batch

# Detailed counters
perf stat -d ./myapp --process-batch

# Profile a running process for 10 seconds
sudo perf stat -p <pid> sleep 10

Output example:

 Performance counter stats for './myapp --process-batch':

       3,412.18 msec  task-clock               #  0.982 CPUs utilized
            218       context-switches          #  63.89 /sec
              4       cpu-migrations            #   1.17 /sec
         12,481       page-faults               #   3.66 K/sec
  9,847,231,004       cycles                    #   2.89 GHz
  7,123,456,789       instructions              #   0.72  insn per cycle
  1,234,567,890       branches                  #  361.76 M/sec
     45,678,901       branch-misses             #   3.70% of all branches
         31,204       cache-misses              #  12.41% of all cache refs

Remember: Mnemonic for reading perf stat output: CIP — Cycles per instruction (or IPC, its inverse), IPC below 1.0 means memory-bound, Page faults mean allocation churn. If IPC > 2.0, the code is compute-efficient. If CPUs utilized < 0.5, the process is waiting, not computing.

Key metrics to watch:

Counter	What It Tells You
`task-clock` / CPUs utilized	Is the workload CPU-bound or waiting?
`context-switches`	High = contention or I/O heavy
`page-faults`	High = memory allocation churn or cold start
`instructions per cycle` (IPC)	< 1.0 = memory/branch stalls; > 2.0 = efficient
`cache-misses`	High % = data structures are cache-unfriendly
`branch-misses`	High % = unpredictable control flow

5. perf trace — Syscall Tracing¶

perf trace is a lightweight alternative to strace for syscall tracing. It uses the kernel's tracing infrastructure rather than ptrace, so it has lower overhead.

# Trace syscalls for a specific process
sudo perf trace -p <pid>

# Trace a command
sudo perf trace ls /tmp

# Trace only specific syscalls
sudo perf trace -e open,read,write,close ls /tmp

# Summary mode (like strace -c)
sudo perf trace -s ./myapp

Compared to strace:

Aspect	strace	perf trace
Mechanism	ptrace (per-syscall stop)	Kernel tracepoints
Overhead	Higher (process stops per call)	Lower
Detail	Full argument decoding	Less argument detail
Attach	Any user can trace own process	Often needs root
Maturity	Battle-tested, ubiquitous	Newer, less portable

6. Sampling vs Full Tracing¶

perf uses two fundamentally different data collection approaches. Choosing the wrong one wastes time or misleads you.

Sampling (perf top, perf record): A hardware timer fires at a fixed rate (default ~4000 Hz). Each interrupt records the current instruction pointer — which function was executing at that instant. After enough samples, the functions that appear most often are the ones consuming the most CPU. The key property: overhead is fixed regardless of workload intensity, because the sample rate is constant.

Tracing (perf trace, perf stat): Hooks into kernel tracepoints or hardware counters to record every occurrence of specific events — every syscall, every context switch, every page fault. The data is complete (no statistical gaps), but overhead scales linearly with event rate. A process making 100K syscalls/sec generates 100K trace records/sec.

Sampling:        ···X···X···X···X···X···  (periodic snapshots)
Tracing:         XXXXX·XX··XXXX·XXX····X  (every event captured)

When to use which:

Question	Use	Why
Where is CPU time going?	Sampling (`perf record`)	Statistical profile of hot functions
How many syscalls per second?	Tracing (`perf trace -s`)	Need exact counts
Is the workload CPU or I/O bound?	Counters (`perf stat`)	Quick characterization
What syscall is blocking?	Tracing (`perf trace`)	Need to see the specific call
What function is 30% of CPU?	Sampling (`perf top`)	Live hot-function view

The most common mistake is using tracing when sampling would suffice — tracing a high-throughput service can add 20-40% overhead and generate gigabytes of data, while sampling the same service adds <2% overhead.

7. Userspace vs Kernel Time¶

When profiling shows a mix of userspace [.] and kernel [k] symbols, the ratio tells you what category of problem you have:

High userspace % (>80% in application symbols): The application's own code is the bottleneck. Look at the hottest functions — are they doing unnecessary work, using a bad algorithm, or allocating excessively? This is the straightforward case.

High kernel % (>40% in [k] symbols): The workload is spending significant time in kernel code. This does NOT mean the kernel is broken. Common causes, in order of likelihood:

Heavy I/O — lots of read/write/sendmsg syscalls, each crossing into kernel. Check with perf stat for high context switches and iostat for disk utilization.
Memory pressure — frequent page-fault, mmap, brk calls. Check perf stat for page-fault counts and vmstat for swapping.
Lock contention — threads blocking on futex in the kernel. Look for pthread_mutex_lock or futex in the profile.
Excessive syscalls — thousands of small reads/writes instead of buffered I/O. The fix is batching, not kernel tuning.

High idle/wait (low task-clock / CPUs utilized in perf stat): The process is not CPU-bound — it is waiting on I/O, network, or locks. perf record will show mostly epoll_wait, poll, or futex at the top. Profiling the CPU further is pointless; switch to I/O tools (iostat, ss, tcpdump) or lock analysis.

8. Prerequisites for Good Profiles¶

Before profiling, verify these are in place:

Requirement	Why	How to Check
Debug symbols	Readable function names vs hex addresses	`file ./myapp` shows "not stripped"
`-dbg` packages	Kernel/library symbols	`apt install linux-tools-$(uname -r)`
Sufficient runtime	Enough samples for significance	Record 10-30 seconds minimum
Correct PID	Profile the right process	`pgrep -f myapp` or `pidof myapp`
Permissions	perf needs access to counters	Root, or `perf_event_paranoid` sysctl
Container awareness	Host PID namespace differs from container	Map container PID to host PID

For containers and Kubernetes:

# Find the host PID of a container process
PID=$(crictl inspect <container-id> | jq '.info.pid')

# Profile from the host
sudo perf top -p $PID
sudo perf record -p $PID -g -- sleep 30

# Kernel perf_event_paranoid setting
# 0 or 1 = normal users can profile; 2 = restricted; 3 = root only
cat /proc/sys/kernel/perf_event_paranoid

What Experienced People Know¶

Start with perf stat before perf record — the counters tell you whether the problem is CPU-bound, memory-bound, or I/O-bound before you invest time in detailed profiling
perf top is the fastest path to "what is burning CPU right now" — it takes two seconds to get an answer, no recording or analysis step needed
Without debug symbols, perf output shows hex addresses instead of function names — install -dbg or -debuginfo packages before profiling, or the output is nearly useless
IPC (instructions per cycle) below 1.0 usually means the workload is stalled on memory access or branch mispredictions, not limited by CPU speed
High context-switch counts suggest lock contention, too many threads, or heavy I/O multiplexing — not necessarily a scheduling problem
--call-graph dwarf gives accurate stack traces for most workloads but increases data file size significantly; use it when you need to understand call chains, skip it for quick checks
Profiling JVM, Python, or Node.js requires extra setup (perf maps, frame pointers, or JIT dump files) to get readable symbols — without this, you see only interpreter internals
perf record defaults to the current directory for perf.data — in production, use -o /tmp/perf.data to avoid filling application directories

Fun fact: Flame graphs were invented by Brendan Gregg at Joyent in 2011 while debugging a MySQL performance issue. He needed a way to visualize thousands of unique stack traces at once. The key insight: collapse identical stacks, sort alphabetically (not by time), and make the width proportional to sample count. The x-axis is not time — it is the alphabetical merge of stack frames. This is the most common misreading of flame graphs.
Flame graphs (via Brendan Gregg's tools) are the most effective way to visualize perf record data — perf report is useful but flame graphs reveal call-chain patterns much faster
In containers, you almost always need to profile from the host using the host-namespace PID, because containers typically lack the required capabilities and kernel access

Prerequisites¶

Linux Ops (Topic Pack, L0)

OpenTelemetry (Topic Pack, L2) — Tracing
Performance Flashcards (CLI) (flashcard_deck, L1) — perf Profiling
Tracing (Topic Pack, L1) — Tracing
Tracing Flashcards (CLI) (flashcard_deck, L1) — Tracing
strace (Topic Pack, L1) — Tracing

perf Profiling - Primer¶

Why This Matters¶

Core Concepts¶

1. perf Subcommands Overview¶

2. perf top — Live Profiling¶

3. perf record / perf report — Offline Profiling¶

4. perf stat — Hardware Counters¶

5. perf trace — Syscall Tracing¶

6. Sampling vs Full Tracing¶

7. Userspace vs Kernel Time¶

8. Prerequisites for Good Profiles¶

What Experienced People Know¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

perf Profiling - Primer¶

Why This Matters¶

Core Concepts¶

1. perf Subcommands Overview¶

2. perf top — Live Profiling¶

3. perf record / perf report — Offline Profiling¶

4. perf stat — Hardware Counters¶

5. perf trace — Syscall Tracing¶

6. Sampling vs Full Tracing¶

7. Userspace vs Kernel Time¶

8. Prerequisites for Good Profiles¶

What Experienced People Know¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶