perf Profiling — Street-Level Ops¶
Quick Diagnosis Commands¶
# What is burning CPU right now? (live view, like top for functions)
sudo perf top
# Press 'q' to quit, 'e' to expand call graph
# Profile a specific process (5-second sample)
sudo perf top -p $(pgrep -f myapp)
# Quick hardware counter summary (is it CPU-bound? memory-bound?)
sudo perf stat -p $(pgrep -f myapp) sleep 10
# Record a profile with call graphs for offline analysis
sudo perf record --call-graph dwarf -p $(pgrep -f myapp) -o /tmp/perf.data -- sleep 30
# Analyze the recorded profile
perf report -i /tmp/perf.data
# Syscall summary (lighter than strace)
sudo perf trace -s -p $(pgrep -f myapp) -- sleep 10
# Check if perf is available and what version
perf version
# If missing: apt install linux-tools-$(uname -r)
# Check perf_event_paranoid (determines who can profile)
cat /proc/sys/kernel/perf_event_paranoid
# -1 = no restrictions, 0 = any user, 1 = limited, 2 = root only
# To temporarily lower: echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid
> **Default trap:** On most distributions, `perf_event_paranoid` defaults to 2 or higher, meaning only root can profile. In containers, even root may be denied because the `perf_event_open` syscall is blocked by the default seccomp profile. Profile from the host using the container's PID instead.
# Verify debug symbols are available
file /usr/bin/python3 | grep -o 'not stripped' || echo 'no symbols'
# For Debian/Ubuntu: apt install python3-dbg linux-image-$(uname -r)-dbgsym
Gotcha: perf report Shows Only Hex Addresses¶
Symptom: perf report output shows 0x00007f3a2b4c1234 instead of function names. The profile is unreadable.
Rule: Without debug symbols, perf cannot resolve addresses to function names. This makes profiles nearly useless.
# Step 1: Check if the binary has symbols
file /path/to/myapp
# "not stripped" = good, "stripped" = no symbols
# Step 2: Install debug symbol packages
# Debian/Ubuntu:
sudo apt install -y linux-tools-$(uname -r) libc6-dbg
# For Python:
sudo apt install -y python3-dbg
# For Go: compile with: go build -gcflags='-N -l'
# Step 3: For containers, profile from the host
# Container binaries may be stripped; install symbols on the host
# Or bind-mount a debug symbol directory into the container
# Step 4: For JVM (Java/Kotlin/Scala)
# Enable perf map generation:
# java -XX:+PreserveFramePointers -XX:+UnlockDiagnosticVMOptions \
# -XX:+DumpPerfMapAtExit -jar app.jar
# This writes /tmp/perf-<pid>.map which perf reads automatically
# Step 5: For Node.js
# node --perf-basic-prof app.js
# Creates /tmp/perf-<pid>.map
# Step 6: For Python
# pip install py-spy (alternative: generates its own flamegraphs)
# Or use perf with Python frame pointers:
# python3 -X perf app.py (Python 3.12+)
Remember: "No symbols = no names." This is the number one reason perf profiles are useless. Before recording, always check:
file /path/to/binary | grep -o 'not stripped'. For interpreted languages (Python, Node, JVM), you need a perf map file — the runtime writes a/tmp/perf-<pid>.mapthat perf reads automatically.
Gotcha: High Kernel Percentage in Profile¶
Symptom: perf top shows 40-60% of CPU time in kernel functions ([k] symbols). You suspect a kernel bug.
Rule: High kernel time almost never means a kernel bug. It means the workload is doing heavy I/O, memory allocation, or hitting lock contention.
# Step 1: Characterize with perf stat
sudo perf stat -p $(pgrep -f myapp) sleep 10
# Look at: context-switches, page-faults, CPUs utilized
# If CPUs utilized < 0.5:
# → Not CPU-bound. Process is waiting on I/O or locks.
# → Switch to: iostat, ss, strace
# If context-switches > 10K/sec:
# → Lock contention or excessive I/O multiplexing
# → Check: perf record -e sched:sched_switch
# If page-faults > 100K:
# → Memory allocation churn
# → Check: perf record -e page-faults
# Step 2: Identify specific kernel functions
sudo perf top -p $(pgrep -f myapp) --no-children
# Common kernel hotspots and what they mean:
# copy_user_enhanced → heavy read/write syscalls
# __alloc_pages → frequent memory allocation
# _raw_spin_lock → kernel lock contention
# tcp_sendmsg → heavy network I/O
# ext4_readpages → heavy disk reads
Pattern: Generate a Flame Graph¶
Flame graphs are the most effective visualization for perf data.
# Step 1: Record with call graphs
sudo perf record --call-graph dwarf -p $(pgrep -f myapp) -o /tmp/perf.data -- sleep 30
# Step 2: Get Brendan Gregg's FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph /opt/FlameGraph 2>/dev/null
# Step 3: Generate the flame graph
perf script -i /tmp/perf.data \
| /opt/FlameGraph/stackcollapse-perf.pl \
| /opt/FlameGraph/flamegraph.pl > /tmp/flamegraph.svg
# Step 4: View in a browser
# scp /tmp/flamegraph.svg to your laptop, open in Chrome/Firefox
# Or serve it: python3 -m http.server 8080 --directory /tmp
# Reading the flame graph:
# - Width = proportion of CPU time (wider = more CPU)
# - Height = stack depth (bottom = entry point, top = leaf function)
# - Click to zoom into a subtree
# - Search (Ctrl+F in the SVG) for specific function names
Pattern: Profile a Container from the Host¶
# Step 1: Find the host PID of the container's main process
# Docker:
PID=$(docker inspect --format '{{.State.Pid}}' mycontainer)
# containerd/crictl:
PID=$(crictl inspect $(crictl ps --name myapp -q) | jq '.info.pid')
# Step 2: Profile using the host PID
sudo perf record --call-graph dwarf -p $PID -o /tmp/perf.data -- sleep 30
# Step 3: If symbols are missing, point perf to container's root filesystem
# Docker:
ROOTFS=$(docker inspect --format '{{.GraphDriver.Data.MergedDir}}' mycontainer)
perf report -i /tmp/perf.data --symfs $ROOTFS
# Step 4: For Kubernetes pods
# Find the node, SSH in, then:
PID=$(crictl inspect $(crictl ps --name myapp -q) | jq '.info.pid')
sudo perf record --call-graph dwarf -p $PID -o /tmp/perf.data -- sleep 30
Pattern: Quick IPC Check (Am I CPU-Bound or Memory-Bound?)¶
# IPC (Instructions Per Cycle) is the single best indicator
sudo perf stat -p $(pgrep -f myapp) sleep 10 2>&1 | grep 'insn per cycle'
# IPC interpretation:
# > 2.0 → CPU-efficient, compute-bound (good utilization)
# 1.0-2.0 → moderate efficiency, normal workloads
# < 1.0 → stalled, likely memory-bound (cache misses, branch misprediction)
# < 0.5 → severely memory-bound (consider data structure optimization)
> **One-liner:** IPC below 1.0 means the CPU is spending more time waiting for data than doing work. Fix the data layout before you optimize the algorithm.
# If IPC is low, check cache misses:
sudo perf stat -e cache-misses,cache-references -p $(pgrep -f myapp) sleep 10
# cache-misses > 20% of cache-references → data is not cache-friendly
# If IPC is low, check branch mispredictions:
sudo perf stat -e branch-misses,branches -p $(pgrep -f myapp) sleep 10
# branch-misses > 5% → unpredictable control flow
Pattern: Comparing Before and After a Code Change¶
# Record baseline
sudo perf stat -d -o /tmp/baseline.txt ./myapp --benchmark
# Deploy the change, then record again
sudo perf stat -d -o /tmp/after.txt ./myapp --benchmark
# Compare (manual diff — perf does not have a built-in diff)
paste /tmp/baseline.txt /tmp/after.txt | column -t
# For detailed comparison, use flamegraph diff:
# Record both profiles
sudo perf record -o /tmp/before.data --call-graph dwarf ./myapp --benchmark
sudo perf record -o /tmp/after.data --call-graph dwarf ./myapp --benchmark
# Generate differential flame graph
perf script -i /tmp/before.data | /opt/FlameGraph/stackcollapse-perf.pl > /tmp/before.folded
perf script -i /tmp/after.data | /opt/FlameGraph/stackcollapse-perf.pl > /tmp/after.folded
/opt/FlameGraph/difffolded.pl /tmp/before.folded /tmp/after.folded \
| /opt/FlameGraph/flamegraph.pl > /tmp/diff.svg
# Red = regression (more CPU), blue = improvement (less CPU)
Pattern: Profiling with perf When You Cannot Install Packages¶
# On minimal containers or locked-down hosts where you cannot install perf:
# Option 1: Use perf from the host (always available if kernel matches)
sudo perf record -p $HOST_PID -- sleep 30
# Option 2: Use a sidecar debug container (Kubernetes)
kubectl debug -it pod/myapp --image=ubuntu:22.04 --target=myapp
# Inside: apt update && apt install -y linux-tools-generic
# Then: perf top -p 1
# Option 3: Use /proc/timer_list for a rough profiling proxy
# Shows which timers are firing most frequently
cat /proc/timer_list | grep -A2 'expires'
# Option 4: Use py-spy for Python (no perf needed)
pip install py-spy
sudo py-spy top --pid $(pgrep -f python)
sudo py-spy record --pid $(pgrep -f python) -o /tmp/profile.svg
Useful One-Liners¶
# List all available perf events
perf list 2>&1 | head -30
# Count context switches for 10 seconds
sudo perf stat -e context-switches -a sleep 10
# Trace page faults for a process
sudo perf record -e page-faults -p $(pgrep -f myapp) -- sleep 10
perf report
# Show top functions with percentage (no TUI, just text)
perf report -i /tmp/perf.data --stdio | head -40
# Record CPU cycles at a lower rate (less overhead for sensitive production)
sudo perf record -F 99 -p $(pgrep -f myapp) -- sleep 30
# 99 Hz instead of default 4000 Hz — much lower overhead
> **Gotcha:** Using `-F 99` instead of `-F 100` avoids lock-step sampling with periodic tasks (timers, GC cycles) that often run at round-number frequencies. The odd sample rate prevents aliasing artifacts in your profile.
# Check if hardware PMU counters are available
perf stat -e cycles true 2>&1 | grep -i 'not supported'
# If in a VM: hardware counters may not be available, use -e cpu-clock instead
# Quick off-CPU analysis (where is the process waiting?)
sudo perf record -e sched:sched_switch -p $(pgrep -f myapp) -- sleep 10
perf report