perf Profiling — Street-Level Ops¶

Quick Diagnosis Commands¶

# What is burning CPU right now? (live view, like top for functions)
sudo perf top
# Press 'q' to quit, 'e' to expand call graph

# Profile a specific process (5-second sample)
sudo perf top -p $(pgrep -f myapp)

# Quick hardware counter summary (is it CPU-bound? memory-bound?)
sudo perf stat -p $(pgrep -f myapp) sleep 10

# Record a profile with call graphs for offline analysis
sudo perf record --call-graph dwarf -p $(pgrep -f myapp) -o /tmp/perf.data -- sleep 30

# Analyze the recorded profile
perf report -i /tmp/perf.data

# Syscall summary (lighter than strace)
sudo perf trace -s -p $(pgrep -f myapp) -- sleep 10

# Check if perf is available and what version
perf version
# If missing: apt install linux-tools-$(uname -r)

# Check perf_event_paranoid (determines who can profile)
cat /proc/sys/kernel/perf_event_paranoid
# -1 = no restrictions, 0 = any user, 1 = limited, 2 = root only
# To temporarily lower: echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid

> **Default trap:** On most distributions, `perf_event_paranoid` defaults to 2 or higher, meaning only root can profile. In containers, even root may be denied because the `perf_event_open` syscall is blocked by the default seccomp profile. Profile from the host using the container's PID instead.

# Verify debug symbols are available
file /usr/bin/python3 | grep -o 'not stripped' || echo 'no symbols'
# For Debian/Ubuntu: apt install python3-dbg linux-image-$(uname -r)-dbgsym

Gotcha: perf report Shows Only Hex Addresses¶

Symptom: perf report output shows 0x00007f3a2b4c1234 instead of function names. The profile is unreadable.

Rule: Without debug symbols, perf cannot resolve addresses to function names. This makes profiles nearly useless.

# Step 1: Check if the binary has symbols
file /path/to/myapp
# "not stripped" = good, "stripped" = no symbols

# Step 2: Install debug symbol packages
# Debian/Ubuntu:
sudo apt install -y linux-tools-$(uname -r) libc6-dbg
# For Python:
sudo apt install -y python3-dbg
# For Go: compile with: go build -gcflags='-N -l'

# Step 3: For containers, profile from the host
# Container binaries may be stripped; install symbols on the host
# Or bind-mount a debug symbol directory into the container

# Step 4: For JVM (Java/Kotlin/Scala)
# Enable perf map generation:
# java -XX:+PreserveFramePointers -XX:+UnlockDiagnosticVMOptions \
#      -XX:+DumpPerfMapAtExit -jar app.jar
# This writes /tmp/perf-<pid>.map which perf reads automatically

# Step 5: For Node.js
# node --perf-basic-prof app.js
# Creates /tmp/perf-<pid>.map

# Step 6: For Python
# pip install py-spy  (alternative: generates its own flamegraphs)
# Or use perf with Python frame pointers:
# python3 -X perf app.py  (Python 3.12+)

Remember: "No symbols = no names." This is the number one reason perf profiles are useless. Before recording, always check: file /path/to/binary | grep -o 'not stripped'. For interpreted languages (Python, Node, JVM), you need a perf map file — the runtime writes a /tmp/perf-<pid>.map that perf reads automatically.

Gotcha: High Kernel Percentage in Profile¶

Symptom: perf top shows 40-60% of CPU time in kernel functions ([k] symbols). You suspect a kernel bug.

Rule: High kernel time almost never means a kernel bug. It means the workload is doing heavy I/O, memory allocation, or hitting lock contention.

# Step 1: Characterize with perf stat
sudo perf stat -p $(pgrep -f myapp) sleep 10
# Look at: context-switches, page-faults, CPUs utilized

# If CPUs utilized < 0.5:
#   → Not CPU-bound. Process is waiting on I/O or locks.
#   → Switch to: iostat, ss, strace

# If context-switches > 10K/sec:
#   → Lock contention or excessive I/O multiplexing
#   → Check: perf record -e sched:sched_switch

# If page-faults > 100K:
#   → Memory allocation churn
#   → Check: perf record -e page-faults

# Step 2: Identify specific kernel functions
sudo perf top -p $(pgrep -f myapp) --no-children
# Common kernel hotspots and what they mean:
# copy_user_enhanced    → heavy read/write syscalls
# __alloc_pages         → frequent memory allocation
# _raw_spin_lock        → kernel lock contention
# tcp_sendmsg           → heavy network I/O
# ext4_readpages        → heavy disk reads

Pattern: Generate a Flame Graph¶

Flame graphs are the most effective visualization for perf data.

# Step 1: Record with call graphs
sudo perf record --call-graph dwarf -p $(pgrep -f myapp) -o /tmp/perf.data -- sleep 30

# Step 2: Get Brendan Gregg's FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph /opt/FlameGraph 2>/dev/null

# Step 3: Generate the flame graph
perf script -i /tmp/perf.data \
  | /opt/FlameGraph/stackcollapse-perf.pl \
  | /opt/FlameGraph/flamegraph.pl > /tmp/flamegraph.svg

# Step 4: View in a browser
# scp /tmp/flamegraph.svg to your laptop, open in Chrome/Firefox
# Or serve it: python3 -m http.server 8080 --directory /tmp

# Reading the flame graph:
# - Width = proportion of CPU time (wider = more CPU)
# - Height = stack depth (bottom = entry point, top = leaf function)
# - Click to zoom into a subtree
# - Search (Ctrl+F in the SVG) for specific function names

Pattern: Profile a Container from the Host¶

# Step 1: Find the host PID of the container's main process
# Docker:
PID=$(docker inspect --format '{{.State.Pid}}' mycontainer)
# containerd/crictl:
PID=$(crictl inspect $(crictl ps --name myapp -q) | jq '.info.pid')

# Step 2: Profile using the host PID
sudo perf record --call-graph dwarf -p $PID -o /tmp/perf.data -- sleep 30

# Step 3: If symbols are missing, point perf to container's root filesystem
# Docker:
ROOTFS=$(docker inspect --format '{{.GraphDriver.Data.MergedDir}}' mycontainer)
perf report -i /tmp/perf.data --symfs $ROOTFS

# Step 4: For Kubernetes pods
# Find the node, SSH in, then:
PID=$(crictl inspect $(crictl ps --name myapp -q) | jq '.info.pid')
sudo perf record --call-graph dwarf -p $PID -o /tmp/perf.data -- sleep 30

Pattern: Quick IPC Check (Am I CPU-Bound or Memory-Bound?)¶

# IPC (Instructions Per Cycle) is the single best indicator
sudo perf stat -p $(pgrep -f myapp) sleep 10 2>&1 | grep 'insn per cycle'

# IPC interpretation:
# > 2.0  → CPU-efficient, compute-bound (good utilization)
# 1.0-2.0 → moderate efficiency, normal workloads
# < 1.0  → stalled, likely memory-bound (cache misses, branch misprediction)
# < 0.5  → severely memory-bound (consider data structure optimization)

> **One-liner:** IPC below 1.0 means the CPU is spending more time waiting for data than doing work. Fix the data layout before you optimize the algorithm.

# If IPC is low, check cache misses:
sudo perf stat -e cache-misses,cache-references -p $(pgrep -f myapp) sleep 10
# cache-misses > 20% of cache-references → data is not cache-friendly

# If IPC is low, check branch mispredictions:
sudo perf stat -e branch-misses,branches -p $(pgrep -f myapp) sleep 10
# branch-misses > 5% → unpredictable control flow

Pattern: Comparing Before and After a Code Change¶

# Record baseline
sudo perf stat -d -o /tmp/baseline.txt ./myapp --benchmark

# Deploy the change, then record again
sudo perf stat -d -o /tmp/after.txt ./myapp --benchmark

# Compare (manual diff — perf does not have a built-in diff)
paste /tmp/baseline.txt /tmp/after.txt | column -t

# For detailed comparison, use flamegraph diff:
# Record both profiles
sudo perf record -o /tmp/before.data --call-graph dwarf ./myapp --benchmark
sudo perf record -o /tmp/after.data --call-graph dwarf ./myapp --benchmark

# Generate differential flame graph
perf script -i /tmp/before.data | /opt/FlameGraph/stackcollapse-perf.pl > /tmp/before.folded
perf script -i /tmp/after.data | /opt/FlameGraph/stackcollapse-perf.pl > /tmp/after.folded
/opt/FlameGraph/difffolded.pl /tmp/before.folded /tmp/after.folded \
  | /opt/FlameGraph/flamegraph.pl > /tmp/diff.svg
# Red = regression (more CPU), blue = improvement (less CPU)

Pattern: Profiling with perf When You Cannot Install Packages¶

# On minimal containers or locked-down hosts where you cannot install perf:

# Option 1: Use perf from the host (always available if kernel matches)
sudo perf record -p $HOST_PID -- sleep 30

# Option 2: Use a sidecar debug container (Kubernetes)
kubectl debug -it pod/myapp --image=ubuntu:22.04 --target=myapp
# Inside: apt update && apt install -y linux-tools-generic
# Then: perf top -p 1

# Option 3: Use /proc/timer_list for a rough profiling proxy
# Shows which timers are firing most frequently
cat /proc/timer_list | grep -A2 'expires'

# Option 4: Use py-spy for Python (no perf needed)
pip install py-spy
sudo py-spy top --pid $(pgrep -f python)
sudo py-spy record --pid $(pgrep -f python) -o /tmp/profile.svg

Useful One-Liners¶

# List all available perf events
perf list 2>&1 | head -30

# Count context switches for 10 seconds
sudo perf stat -e context-switches -a sleep 10

# Trace page faults for a process
sudo perf record -e page-faults -p $(pgrep -f myapp) -- sleep 10
perf report

# Show top functions with percentage (no TUI, just text)
perf report -i /tmp/perf.data --stdio | head -40

# Record CPU cycles at a lower rate (less overhead for sensitive production)
sudo perf record -F 99 -p $(pgrep -f myapp) -- sleep 30
# 99 Hz instead of default 4000 Hz — much lower overhead

> **Gotcha:** Using `-F 99` instead of `-F 100` avoids lock-step sampling with periodic tasks (timers, GC cycles) that often run at round-number frequencies. The odd sample rate prevents aliasing artifacts in your profile.

# Check if hardware PMU counters are available
perf stat -e cycles true 2>&1 | grep -i 'not supported'
# If in a VM: hardware counters may not be available, use -e cpu-clock instead

# Quick off-CPU analysis (where is the process waiting?)
sudo perf record -e sched:sched_switch -p $(pgrep -f myapp) -- sleep 10
perf report