Skip to content

How to Read a Flame Graph

  • lesson
  • cpu-profiling
  • perf
  • flame-graphs
  • stack-traces
  • bottleneck-identification
  • l2 ---# How to Read a Flame Graph

Topics: CPU profiling, perf, flame graphs, stack traces, bottleneck identification Level: L2 (Operations) Time: 45–60 minutes Prerequisites: Basic Linux command line


The Mission

Someone sends you a flame graph. It's a colorful rectangle with hundreds of little boxes stacked on top of each other. They say "the bottleneck is obvious." You stare at it. You have no idea what you're looking at.

Flame graphs are the most powerful performance visualization tool in existence — but only if you know how to read them. This lesson teaches you to read, generate, and act on flame graphs in 15 minutes of practice.


What a Flame Graph Shows

A flame graph visualizes where your program spends CPU time by sampling stack traces.

┌─────────────────────────────────────────────────────────┐
│                         main()                          │  ← bottom: entry point
├───────────────────────┬─────────────────────────────────┤
│    handle_request()   │         process_batch()          │
├──────────┬────────────┤─────────────────────────────────┤
│ parse()  │ db_query() │         sort_results()           │
├──────────┤────────────┤──────────┬──────────────────────┤
│          │  execute() │          │     compare()         │
│          │            │          │                        │
└──────────┴────────────┴──────────┴──────────────────────┘

How to read it:

  • Y-axis (height): Stack depth. Bottom = entry point. Top = leaf functions (where CPU is actually spent).
  • X-axis (width): NOT time. Width = percentage of total samples. Wider = more CPU time.
  • Color: Random (or language-based). Color doesn't mean anything by default.
  • Order: X-axis is alphabetically sorted, NOT chronological. Don't read left-to-right as "first this, then that."

Name Origin: Flame graphs were invented by Brendan Gregg at Netflix in 2011 while debugging a MySQL performance problem. He needed a way to visualize thousands of stack trace samples at once. The "flame" name comes from the visual shape — the top looks like flickering flames. Visualization researchers initially rejected the design (the x-axis is meaningless, no y-axis label), but adoption proved its usability.


The 60-Second Reading Method

  1. Look at the widest bars at the TOP — these are where CPU is actually spent (leaf functions). A wide malloc() at the top = lots of memory allocation. A wide write() = lots of I/O.

  2. Trace wide bars DOWN — follow the wide top bar down through its parents to understand the call chain. "Oh, malloc() is wide because parse_json() calls it heavily, which is called from handle_request()."

  3. Ignore narrow bars — if a function is 1% of samples, it's not your bottleneck. Focus on the fat stacks.

  4. Compare before and after — generate a flame graph before your optimization, make the change, generate another. The visual diff makes improvements obvious.

BEFORE: sort_results() is 40% of CPU    ← wide bar
AFTER:  sort_results() is 5% of CPU     ← narrow bar
The optimization worked. Sort went from O(n²) to O(n log n).

Generating Flame Graphs

Step 1: Record samples

# Record CPU samples for 30 seconds (all processes)
perf record -F 99 -a -g -- sleep 30
# -F 99 = 99 Hz sampling (not 100, to avoid synchronization artifacts)
# -a = all CPUs
# -g = capture call graphs (stack traces)

# Record a specific process
perf record -F 99 -g -p $(pgrep myapp) -- sleep 30

Step 2: Generate the flame graph

# Convert perf data to a flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

# Open in browser
firefox flame.svg
# (SVG is interactive — you can click to zoom into subtrees)

Gotcha: perf record -g captures kernel + user stacks. If you see [unknown] entries, your binary was compiled without frame pointers or debug symbols. Add -fno-omit-frame-pointer to your compiler flags, or use --call-graph dwarf with perf.

For other languages

# Java (use async-profiler, not perf)
./async-profiler -d 30 -f flame.svg -o svg <pid>

# Go (built-in pprof)
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30

# Python (py-spy)
py-spy record -o flame.svg --pid <pid>

# Node.js (0x or clinic.js)
npx 0x app.js

What Common Bottlenecks Look Like

CPU-bound: wide tower in user code

┌──────────────────────────────────────────────┐
│                  encrypt()                    │  ← 60% of CPU
├──────────────────────────────────────────────┤
│               process_request()               │
├──────────────────────────────────────────────┤
│                 main_loop()                   │

One function dominates. Fix: optimize that function, or do less of it.

I/O wait: wide read/write/poll in kernel

┌──────────────────────────────────────────────┐
│              do_sys_read [kernel]              │  ← 50% in kernel I/O
├──────────────────────────────────────────────┤
│               read_file()                     │
├──────────────────────────────────────────────┤
│              process_data()                   │

Half the time is in kernel I/O. Fix: cache reads, use async I/O, or fix the slow disk.

Lock contention: wide futex / pthread_mutex_lock

┌──────────────────────────────────────────────┐
│          __lll_lock_wait [glibc]              │  ← 30% waiting on locks
├──────────────────────────────────────────────┤
│            pthread_mutex_lock                 │
├──────────────────────────────────────────────┤
│              update_cache()                   │

Threads are waiting for locks. Fix: reduce lock scope, use lock-free data structures, or reduce contention.

GC pressure: wide gc_collect / GC_Main

┌──────────────────────────────────────────────┐
│              gc_compact [runtime]              │  ← 15% in GC
├──────────────────────────────────────────────┤
│              runtime.mallocgc                 │
├──────────────────────────────────────────────┤
│             allocate_objects()                │

Too much allocation → GC works hard. Fix: reduce allocations, reuse objects, tune GC.


Off-CPU Flame Graphs: The Other Half

Standard flame graphs show ON-CPU time (where the process is running). But if your process is slow because it's WAITING (I/O, locks, sleep), those waits are invisible in a CPU flame graph.

Off-CPU flame graphs show where the process was blocked:

# Requires eBPF (bcc tools)
offcputime-bpfcc -p $(pgrep myapp) -f 30 > offcpu.stacks
flamegraph.pl --color=io --countname=us < offcpu.stacks > offcpu.svg

Off-CPU flame graphs reveal: - Blocked on disk I/O (wide io_schedule) - Blocked on network (wide tcp_recvmsg) - Blocked on locks (wide futex) - Sleeping (wide nanosleep)

Mental Model: CPU flame graph = "where is the process running?" Off-CPU flame graph = "where is the process waiting?" Together they account for 100% of wall clock time.


Flashcard Check

Q1: In a flame graph, what does the width of a bar represent?

Percentage of total CPU samples. Wider = more time spent. NOT duration or chronological order. The x-axis is alphabetically sorted.

Q2: Where should you look first in a flame graph?

The widest bars at the TOP. These are leaf functions where CPU is actually consumed. Trace them down to understand the call chain.

Q3: You see [unknown] entries. What's wrong?

The binary lacks frame pointers or debug info. Compile with -fno-omit-frame-pointer or use perf record --call-graph dwarf.

Q4: CPU flame graph shows everything is fast but the app is slow. What's missing?

Off-CPU time. The process is waiting (I/O, locks, network). Use off-CPU flame graphs to see where it's blocked.


Cheat Sheet

Quick Flame Graph (3 commands)

perf record -F 99 -a -g -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
firefox flame.svg

Language-Specific Profilers

Language Tool Command
C/C++/Rust perf perf record -F 99 -g -p PID -- sleep 30
Java async-profiler ./asprof -d 30 -f flame.svg PID
Go pprof go tool pprof -http=:8080 /debug/pprof/profile
Python py-spy py-spy record -o flame.svg --pid PID
Node.js 0x npx 0x app.js

Bottleneck Patterns

Wide bar at top Likely cause Fix
User function CPU-bound Optimize algorithm
malloc / gc_collect Allocation pressure Reduce allocations, tune GC
futex / mutex_lock Lock contention Reduce lock scope
read / write (kernel) I/O bound Cache, async I/O
poll / epoll_wait Idle/waiting Normal for event loops

Takeaways

  1. Width is everything. Wide bars = CPU time. Narrow bars = ignore. Look at the widest bars at the top first.

  2. X-axis is NOT time. It's alphabetical. Don't read left-to-right as "first → then."

  3. On-CPU + Off-CPU = full picture. If the CPU flame graph looks fine but the app is slow, the bottleneck is off-CPU (I/O, locks, network).

  4. Generate, don't guess. 30 seconds of perf record gives you hard data. Don't speculate about "I think it's the database" — profile and prove it.

  5. Flame graphs invented at Netflix in 2011. Brendan Gregg. Still the best performance visualization tool. Learn to read them — it's a career skill.


Exercises

  1. Generate a CPU flame graph. Install Brendan Gregg's FlameGraph tools: git clone https://github.com/brendangregg/FlameGraph.git. Run a CPU-intensive workload (e.g., find / -type f 2>/dev/null or gzip < /dev/urandom | head -c 100M > /dev/null). Record with perf record -F 99 -a -g -- sleep 10, then generate: perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > flame.svg. Open the SVG in a browser and identify the widest bar at the top of the graph.

  2. Read a flame graph for bottlenecks. Using the flame graph from exercise 1 (or a sample from brendangregg.com/flamegraphs.html), answer: (a) which leaf function consumed the most CPU? (b) what is its parent call chain? (c) are there any [unknown] frames, and what would cause them? Write your answers down, then verify by clicking the SVG to zoom into the relevant subtree.

  3. Profile a Python script with py-spy. Install py-spy (pip install py-spy). Create a Python script that does something measurable (e.g., sort a large list: import random; x = [random.random() for _ in range(5_000_000)]; x.sort()). Profile it: py-spy record -o py-flame.svg -- python3 your_script.py. Open the SVG and identify whether most time is in Python code, C library functions, or the sort implementation.

  4. Compare before and after. Write a Python script with an intentionally slow operation (e.g., nested loops doing string concatenation). Profile it with py-spy. Optimize the code (e.g., use "".join()). Profile again. Place both SVGs side by side and identify the visual difference in the width of the bottleneck function.


  • The Mysterious Latency Spike — flame graphs as one tool in the diagnosis toolkit
  • strace: Reading the Matrix — syscall-level debugging
  • eBPF: The Linux Superpower — off-CPU flame graphs with eBPF