How to Read a Flame Graph

lesson
cpu-profiling
perf
flame-graphs
stack-traces
bottleneck-identification
l2 ---# How to Read a Flame Graph

Topics: CPU profiling, perf, flame graphs, stack traces, bottleneck identification Level: L2 (Operations) Time: 45–60 minutes Prerequisites: Basic Linux command line

The Mission¶

Someone sends you a flame graph. It's a colorful rectangle with hundreds of little boxes stacked on top of each other. They say "the bottleneck is obvious." You stare at it. You have no idea what you're looking at.

Flame graphs are the most powerful performance visualization tool in existence — but only if you know how to read them. This lesson teaches you to read, generate, and act on flame graphs in 15 minutes of practice.

What a Flame Graph Shows¶

A flame graph visualizes where your program spends CPU time by sampling stack traces.

┌─────────────────────────────────────────────────────────┐
│                         main()                          │  ← bottom: entry point
├───────────────────────┬─────────────────────────────────┤
│    handle_request()   │         process_batch()          │
├──────────┬────────────┤─────────────────────────────────┤
│ parse()  │ db_query() │         sort_results()           │
├──────────┤────────────┤──────────┬──────────────────────┤
│          │  execute() │          │     compare()         │
│          │            │          │                        │
└──────────┴────────────┴──────────┴──────────────────────┘

How to read it:

Y-axis (height): Stack depth. Bottom = entry point. Top = leaf functions (where CPU is actually spent).
X-axis (width): NOT time. Width = percentage of total samples. Wider = more CPU time.
Color: Random (or language-based). Color doesn't mean anything by default.
Order: X-axis is alphabetically sorted, NOT chronological. Don't read left-to-right as "first this, then that."

Name Origin: Flame graphs were invented by Brendan Gregg at Netflix in 2011 while debugging a MySQL performance problem. He needed a way to visualize thousands of stack trace samples at once. The "flame" name comes from the visual shape — the top looks like flickering flames. Visualization researchers initially rejected the design (the x-axis is meaningless, no y-axis label), but adoption proved its usability.

The 60-Second Reading Method¶

Look at the widest bars at the TOP — these are where CPU is actually spent (leaf functions). A wide malloc() at the top = lots of memory allocation. A wide write() = lots of I/O.
Trace wide bars DOWN — follow the wide top bar down through its parents to understand the call chain. "Oh, malloc() is wide because parse_json() calls it heavily, which is called from handle_request()."
Ignore narrow bars — if a function is 1% of samples, it's not your bottleneck. Focus on the fat stacks.
Compare before and after — generate a flame graph before your optimization, make the change, generate another. The visual diff makes improvements obvious.

BEFORE: sort_results() is 40% of CPU    ← wide bar
AFTER:  sort_results() is 5% of CPU     ← narrow bar
The optimization worked. Sort went from O(n²) to O(n log n).

Generating Flame Graphs¶

Step 1: Record samples¶

# Record CPU samples for 30 seconds (all processes)
perf record -F 99 -a -g -- sleep 30
# -F 99 = 99 Hz sampling (not 100, to avoid synchronization artifacts)
# -a = all CPUs
# -g = capture call graphs (stack traces)

# Record a specific process
perf record -F 99 -g -p $(pgrep myapp) -- sleep 30

Step 2: Generate the flame graph¶

# Convert perf data to a flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

# Open in browser
firefox flame.svg
# (SVG is interactive — you can click to zoom into subtrees)

Gotcha: perf record -g captures kernel + user stacks. If you see [unknown] entries, your binary was compiled without frame pointers or debug symbols. Add -fno-omit-frame-pointer to your compiler flags, or use --call-graph dwarf with perf.

For other languages¶

# Java (use async-profiler, not perf)
./async-profiler -d 30 -f flame.svg -o svg <pid>

# Go (built-in pprof)
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30

# Python (py-spy)
py-spy record -o flame.svg --pid <pid>

# Node.js (0x or clinic.js)
npx 0x app.js

What Common Bottlenecks Look Like¶

CPU-bound: wide tower in user code¶

┌──────────────────────────────────────────────┐
│                  encrypt()                    │  ← 60% of CPU
├──────────────────────────────────────────────┤
│               process_request()               │
├──────────────────────────────────────────────┤
│                 main_loop()                   │

One function dominates. Fix: optimize that function, or do less of it.

I/O wait: wide `read`/`write`/`poll` in kernel¶

┌──────────────────────────────────────────────┐
│              do_sys_read [kernel]              │  ← 50% in kernel I/O
├──────────────────────────────────────────────┤
│               read_file()                     │
├──────────────────────────────────────────────┤
│              process_data()                   │

Half the time is in kernel I/O. Fix: cache reads, use async I/O, or fix the slow disk.

Lock contention: wide `futex` / `pthread_mutex_lock`¶

┌──────────────────────────────────────────────┐
│          __lll_lock_wait [glibc]              │  ← 30% waiting on locks
├──────────────────────────────────────────────┤
│            pthread_mutex_lock                 │
├──────────────────────────────────────────────┤
│              update_cache()                   │

Threads are waiting for locks. Fix: reduce lock scope, use lock-free data structures, or reduce contention.

GC pressure: wide `gc_collect` / `GC_Main`¶

┌──────────────────────────────────────────────┐
│              gc_compact [runtime]              │  ← 15% in GC
├──────────────────────────────────────────────┤
│              runtime.mallocgc                 │
├──────────────────────────────────────────────┤
│             allocate_objects()                │

Too much allocation → GC works hard. Fix: reduce allocations, reuse objects, tune GC.

Off-CPU Flame Graphs: The Other Half¶

Standard flame graphs show ON-CPU time (where the process is running). But if your process is slow because it's WAITING (I/O, locks, sleep), those waits are invisible in a CPU flame graph.

Off-CPU flame graphs show where the process was blocked:

# Requires eBPF (bcc tools)
offcputime-bpfcc -p $(pgrep myapp) -f 30 > offcpu.stacks
flamegraph.pl --color=io --countname=us < offcpu.stacks > offcpu.svg

Off-CPU flame graphs reveal: - Blocked on disk I/O (wide io_schedule) - Blocked on network (wide tcp_recvmsg) - Blocked on locks (wide futex) - Sleeping (wide nanosleep)

Mental Model: CPU flame graph = "where is the process running?" Off-CPU flame graph = "where is the process waiting?" Together they account for 100% of wall clock time.

Flashcard Check¶

Q1: In a flame graph, what does the width of a bar represent?

Percentage of total CPU samples. Wider = more time spent. NOT duration or chronological order. The x-axis is alphabetically sorted.

Q2: Where should you look first in a flame graph?

The widest bars at the TOP. These are leaf functions where CPU is actually consumed. Trace them down to understand the call chain.

Q3: You see [unknown] entries. What's wrong?

The binary lacks frame pointers or debug info. Compile with -fno-omit-frame-pointer or use perf record --call-graph dwarf.

Q4: CPU flame graph shows everything is fast but the app is slow. What's missing?

Off-CPU time. The process is waiting (I/O, locks, network). Use off-CPU flame graphs to see where it's blocked.

Cheat Sheet¶

Quick Flame Graph (3 commands)¶

perf record -F 99 -a -g -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
firefox flame.svg

Language-Specific Profilers¶

Language	Tool	Command
C/C++/Rust	perf	`perf record -F 99 -g -p PID -- sleep 30`
Java	async-profiler	`./asprof -d 30 -f flame.svg PID`
Go	pprof	`go tool pprof -http=:8080 /debug/pprof/profile`
Python	py-spy	`py-spy record -o flame.svg --pid PID`
Node.js	0x	`npx 0x app.js`

Bottleneck Patterns¶

Wide bar at top	Likely cause	Fix
User function	CPU-bound	Optimize algorithm
`malloc` / `gc_collect`	Allocation pressure	Reduce allocations, tune GC
`futex` / `mutex_lock`	Lock contention	Reduce lock scope
`read` / `write` (kernel)	I/O bound	Cache, async I/O
`poll` / `epoll_wait`	Idle/waiting	Normal for event loops

Takeaways¶

Width is everything. Wide bars = CPU time. Narrow bars = ignore. Look at the widest bars at the top first.
X-axis is NOT time. It's alphabetical. Don't read left-to-right as "first → then."
On-CPU + Off-CPU = full picture. If the CPU flame graph looks fine but the app is slow, the bottleneck is off-CPU (I/O, locks, network).
Generate, don't guess. 30 seconds of perf record gives you hard data. Don't speculate about "I think it's the database" — profile and prove it.
Flame graphs invented at Netflix in 2011. Brendan Gregg. Still the best performance visualization tool. Learn to read them — it's a career skill.

Exercises¶

Generate a CPU flame graph. Install Brendan Gregg's FlameGraph tools: git clone https://github.com/brendangregg/FlameGraph.git. Run a CPU-intensive workload (e.g., find / -type f 2>/dev/null or gzip < /dev/urandom | head -c 100M > /dev/null). Record with perf record -F 99 -a -g -- sleep 10, then generate: perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > flame.svg. Open the SVG in a browser and identify the widest bar at the top of the graph.
Read a flame graph for bottlenecks. Using the flame graph from exercise 1 (or a sample from brendangregg.com/flamegraphs.html), answer: (a) which leaf function consumed the most CPU? (b) what is its parent call chain? (c) are there any [unknown] frames, and what would cause them? Write your answers down, then verify by clicking the SVG to zoom into the relevant subtree.
Profile a Python script with py-spy. Install py-spy (pip install py-spy). Create a Python script that does something measurable (e.g., sort a large list: import random; x = [random.random() for _ in range(5_000_000)]; x.sort()). Profile it: py-spy record -o py-flame.svg -- python3 your_script.py. Open the SVG and identify whether most time is in Python code, C library functions, or the sort implementation.
Compare before and after. Write a Python script with an intentionally slow operation (e.g., nested loops doing string concatenation). Profile it with py-spy. Optimize the code (e.g., use "".join()). Profile again. Place both SVGs side by side and identify the visual difference in the width of the bottleneck function.

The Mysterious Latency Spike — flame graphs as one tool in the diagnosis toolkit
strace: Reading the Matrix — syscall-level debugging
eBPF: The Linux Superpower — off-CPU flame graphs with eBPF