Performance Profiling — Trivia & Interesting Facts¶
Surprising, historical, and little-known facts about performance profiling on Linux.
perf was written by Ingo Molnar and Thomas Gleixner over a weekend¶
The perf tool (originally "perf_counter") was written in 2008 and merged into Linux 2.6.31 (2009). Ingo Molnar, the kernel's scheduling subsystem maintainer, created it as a response to OProfile's limitations. perf leverages hardware Performance Monitoring Counters (PMCs) built into every modern CPU to count events like cache misses, branch mispredictions, and instructions retired with near-zero overhead.
Hardware performance counters have been in CPUs since the Pentium Pro¶
Intel added Performance Monitoring Counters to the Pentium Pro in 1995. These are dedicated CPU registers that count microarchitectural events (L1 cache hits, TLB misses, branch mispredictions) without slowing the workload. Modern Intel CPUs have 4-8 general-purpose PMCs. Despite being available for 30 years, most developers have never directly used them.
Sampling profilers lie — and that is acceptable¶
Sampling profilers like perf periodically interrupt the CPU and record which function is executing. With 4,000 samples per second, short-lived functions might be missed entirely. This statistical approach means profiling results are probabilistic, not exact. Despite this limitation, sampling profilers are preferred in production because their overhead is predictable (typically under 2%) regardless of workload complexity.
Flame graphs were rejected by academic visualization researchers¶
When Brendan Gregg presented flame graphs at conferences, visualization researchers criticized the use of the x-axis for alphabetical sorting (not time) and the lack of a y-axis label. Gregg argued that flame graphs optimized for practitioner usability, not visualization theory. The technique's explosive adoption proved his point — flame graphs are now the most widely used performance visualization in the industry.
DTrace was so good that Linux spent 15 years trying to replicate it¶
DTrace, created by Bryan Cantrill, Mike Shapiro, and Adam Leventhal at Sun Microsystems in 2003, provided safe, production-grade tracing for Solaris. Linux lacked an equivalent until eBPF matured around 2016-2018. The 15-year gap forced Linux developers to use unsafe kernel modules, limited ftrace, or SystemTap (which required kernel debug symbols). DTrace's influence on eBPF's design is explicitly acknowledged.
Off-CPU profiling reveals the other half of the story¶
Traditional CPU profiling only shows time spent executing on-CPU. Off-CPU profiling (measuring time a thread spends blocked — sleeping, waiting for locks, waiting for I/O) reveals the other half. Brendan Gregg's off-CPU flame graphs (implemented via eBPF) showed that many "slow" applications spent more time blocked than executing, and CPU profiles alone would never explain the latency.
Valgrind runs programs 10-50x slower — and this is by design¶
Valgrind (by Julian Seward, 2002) works by translating every machine instruction through a synthetic CPU. This instrumentation approach allows Memcheck (memory error detection) and Callgrind (call-graph profiling) to catch every allocation, deallocation, and memory access. The slowdown makes Valgrind unsuitable for production but invaluable for development. It has found bugs in nearly every major open-source project.
gprof was the first widely-used profiler — from 1982¶
gprof (GNU profiler), based on research by Susan Graham, Peter Kessler, and Marshall McKusick at UC Berkeley, was one of the first practical profiling tools. It combines instrumentation (counting function calls) with sampling (measuring CPU time). Despite being over 40 years old, gcc -pg + gprof is still taught in university courses. Its limitations (no per-line profiling, no shared library support) drove the development of modern alternatives.
Intel VTune is free now — and most people do not know¶
Intel VTune Profiler, which cost thousands of dollars until 2020, is now free as part of Intel oneAPI. It provides the most detailed hardware event profiling available — down to individual pipeline stages, cache line contention, and memory bandwidth per function. For Intel CPUs, no other tool provides this level of microarchitectural insight.
Continuous profiling is replacing point-in-time profiling¶
Tools like Pyroscope, Parca, and Google Cloud Profiler continuously collect profiling data from production services and store it as a time series. This lets you compare "last Tuesday at 3 PM" to "today at 3 PM" and see exactly which functions got slower. Google has been doing this internally since at least 2010 (Google-Wide Profiling), but the approach only became available to the broader industry around 2020.
The compiler is the best profiling tool most people ignore¶
Profile-Guided Optimization (PGO) instruments a program, runs it with representative workloads, and feeds the profile back to the compiler. The compiler then optimizes hot paths, aligns branches, and inlines aggressively based on real data. PGO typically improves performance by 10-20%. Firefox, Chrome, and the Linux kernel itself use PGO for release builds.