perf Profiling Footguns¶

Mistakes that produce misleading profiles, waste investigation time, or add unacceptable overhead to production systems.

1. Profiling Without Debug Symbols¶

You run perf record on a production binary and open perf report. Every function shows as a hex address like 0x00007f3a2b4c1234. You cannot tell which function is consuming CPU. You spend an hour trying to decode addresses manually. The binary was compiled with strip or the -s flag, removing all symbol information.

Fix: Install debug symbol packages before profiling. On Debian/Ubuntu: apt install libc6-dbg linux-image-$(uname -r)-dbgsym. For Go: build with go build -gcflags='-N -l' (disables inlining and optimization for readable profiles). For C/C++: compile with -g and do not strip. For production binaries that must be stripped, keep a separate debug symbol file and pass it to perf with --symfs.

2. Using Full Tracing When Sampling Would Suffice¶

You attach perf trace (full syscall tracing) to a production service doing 80,000 syscalls per second. Each syscall generates a trace record. The overhead is 20-40%. The service's latency doubles. Monitoring alerts fire. You are now causing the production issue you were trying to diagnose.

Fix: Start with sampling (perf top or perf record) for CPU profiling — overhead is constant at <2% regardless of workload intensity. Use perf stat for aggregate counters (near-zero overhead). Reserve full tracing (perf trace) for short bursts (5-10 seconds) on specific syscall categories (-e trace=network). Never leave a trace attached to a high-throughput process for more than 30 seconds in production.

3. Profiling at Default Sample Rate in Production¶

The default perf record sample rate is ~4000 Hz. On a latency-sensitive service, even 4000 interrupts per second can add measurable jitter. For real-time or high-frequency trading workloads, this is unacceptable.

Fix: Lower the sample rate for production profiling:

# 99 Hz — low overhead, still produces useful profiles in 30 seconds
sudo perf record -F 99 -p $PID -- sleep 30

99 Hz is a common choice because it avoids aliasing with common timer frequencies (100 Hz, 250 Hz, 1000 Hz kernel timers). Even at 99 Hz, a 30-second recording produces ~3000 samples — enough for statistical significance on hot functions.

Remember: The prime-number trick (99 Hz instead of 100 Hz) prevents lock-step sampling where every sample lands at the same point in a periodic timer handler. If your sample rate is a multiple of the kernel timer frequency, you get a biased profile that over-represents timer interrupt code. Brendan Gregg popularized this guideline — use 49, 97, or 99 Hz for production profiling.

4. Recording Without Call Graphs¶

You run perf record -p $PID and analyze with perf report. You see that memcpy is using 15% of CPU. But you do not know who is calling memcpy. Without call graphs, you know the leaf function but not the call chain that leads to it. You cannot fix the problem because you cannot identify the caller.

Fix: Always record with call graphs when you need to understand call chains:

sudo perf record --call-graph dwarf -p $PID -- sleep 30

The dwarf method uses DWARF unwind information and works with most compiled languages. The fp method (frame pointer) is faster but requires binaries compiled with -fno-omit-frame-pointer. For Go, dwarf is required. The tradeoff: --call-graph dwarf produces 5-10x larger data files.

5. Interpreting High Kernel Percentage as a Kernel Bug¶

perf top shows 50% of CPU time in kernel functions. You file a bug report against the kernel. The actual cause: your application makes 100,000 small write() calls per second instead of buffering. Each write() crosses into kernel space. The kernel is doing exactly what it was asked to do — the bug is in your application's I/O pattern.

Fix: High kernel percentage means the workload is kernel-intensive (I/O, memory management, locking), not that the kernel is broken. Investigate what is driving the kernel work: - copy_user_enhanced → too many small read/write syscalls, fix with buffering - _raw_spin_lock → kernel lock contention, reduce contention in application - __alloc_pages → frequent memory allocation, use memory pools - tcp_sendmsg → heavy network I/O, batch sends or use sendmmsg

6. Profiling the Wrong Process in a Container¶

You profile PID 1 inside a container. PID 1 is the init process (tini, dumb-init) or the shell that launched your application. The actual application runs as a child process with a different PID. Your profile shows mostly idle/wait because the init process does almost nothing.

Fix: Profile from the host using the host-namespace PID of the actual application process:

# Find the right PID
docker top mycontainer
# Or:
PID=$(docker inspect --format '{{.State.Pid}}' mycontainer)
# This is the container's PID 1 in host namespace
# If the app is a child process, find it:
pstree -p $PID
# Profile the actual app PID, not the init wrapper

7. Forgetting That VMs May Not Expose Hardware PMU Counters¶

You run perf stat inside a virtual machine and get: <not supported> cycles. Hardware Performance Monitoring Unit (PMU) counters are not virtualized by default in many hypervisors. Without hardware counters, perf stat and perf record using cycles events fail or produce empty profiles.

Fix: Check for PMU support: perf stat -e cycles true. If not supported, use software events instead:

# Software event that always works (lower precision but functional)
sudo perf record -e cpu-clock -p $PID -- sleep 30
sudo perf stat -e cpu-clock,task-clock,context-switches -p $PID sleep 10

On AWS EC2, enable PMU passthrough by choosing a metal instance type or requesting PMU support in the instance configuration. On KVM, add -cpu host to expose host PMU counters to the guest.

8. Filling Disk with perf.data in Production¶

You start perf record --call-graph dwarf on a busy production process and forget about it. The perf.data file grows at 10-50 MB/second. After an hour, it has consumed 30 GB of disk space. The application's data partition fills up, causing the application itself to fail.

Fix: Always limit recording duration and specify an output path:

# Record for exactly 30 seconds, output to /tmp (not the app directory)
sudo perf record --call-graph dwarf -p $PID -o /tmp/perf.data -- sleep 30

Use -m 512 to limit the ring buffer size. Monitor disk usage during recording. Set a cron job or alarm to remind you to stop the recording. Never use perf record without a sleep duration limiter in production.

9. Comparing Profiles from Different Time Periods Without Context¶

You record a profile on Monday when the service is idle, and another on Wednesday during peak load. The Wednesday profile shows 5x more time in parse_json. You conclude a code regression happened. In reality, the service is just processing 5x more requests. The per-request cost is identical.

Fix: Normalize profiles to meaningful units. Use perf stat to get absolute instruction counts and divide by request count. Compare IPC (instructions per cycle) which is workload-independent. When comparing flame graphs, use differential flame graphs (difffolded.pl) which show the delta between two profiles. Always record the request rate and other context alongside the profile.

10. Using perf top in a Headless Script¶

You try to use perf top in a CI/CD pipeline or automated diagnostic script. perf top requires an interactive terminal — it continuously updates the screen. In a non-interactive context, it either hangs waiting for terminal input or produces garbled output.

Fix: Use perf record + perf report --stdio for non-interactive profiling:

# Record for 10 seconds
sudo perf record -F 99 -p $PID -o /tmp/perf.data -- sleep 10
# Generate text report (no TUI)
perf report -i /tmp/perf.data --stdio --percent-limit 1 > /tmp/profile.txt
# Top 20 functions:
perf report -i /tmp/perf.data --stdio | head -40

11. Not Accounting for Compiler Optimizations in Profiles¶

You profile a C++ application and see that function process_batch() uses 0% CPU. You know this function does heavy computation. The compiler inlined process_batch() into its caller, so the function no longer exists as a discrete entity in the profile. All its CPU time is attributed to the caller.

Fix: For accurate per-function attribution, compile with -fno-inline during profiling sessions (not in production). Alternatively, use --call-graph dwarf and look at the call chain — inlined functions may still appear in DWARF debug info even though they are not separate functions in the binary. For production profiling where you cannot recompile, accept that inlined functions are attributed to their callers and read profiles accordingly.

12. Ignoring Off-CPU Time When the Problem Is Latency¶

Your service has high P99 latency. You run perf record and see that CPU functions are fast — the hottest function is only 2% of total samples. The profile looks healthy. But the problem is not CPU — the process spends 90% of its time blocked on I/O, lock acquisition, or DNS resolution. On-CPU profiling misses this entirely.

Fix: For latency problems, profile off-CPU time (where the process is waiting):

# Trace scheduler switches to see what the process is waiting on
sudo perf record -e sched:sched_switch -p $PID -- sleep 10
perf report

# Or use BCC/bpftrace for off-CPU flame graphs
/usr/share/bcc/tools/offcputime -p $PID -df 10 \
  | /opt/FlameGraph/flamegraph.pl --countname=us > /tmp/offcpu.svg

If perf stat shows CPUs utilized < 0.5, the process is not CPU-bound and on-CPU profiling will not find the bottleneck.