Portal | Level: L2: Operations | Topics: Continuous Profiling | Domain: Observability

Continuous Profiling — Primer¶

Why This Matters¶

Metrics tell you that latency is high. Traces tell you which service is slow. Profiling tells you why — which function, which line, which allocation. Without continuous profiling, you are guessing at the root cause of every performance regression.

One-shot profiling (attaching pprof to a container during an incident) gives you a snapshot under artificial pressure. Continuous profiling captures CPU, memory, and goroutine state always-on at low overhead (typically 1-5% CPU). When a regression appears in your metrics, you can rewind the profiler timeline and see exactly what changed in the call stack between before and after.

Timeline: Continuous profiling as a concept was popularized by Google's 2010 paper "Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers." Pyroscope was created by Dmitry Filimonov in 2021 and acquired by Grafana Labs in 2023. Parca was started by Polar Signals and donated to the CNCF in 2022.

Pyroscope and Parca are the two dominant open-source platforms. Both integrate with Prometheus exemplars so you can jump from a slow trace span directly into the matching flame graph. This is the fourth signal after logs, metrics, and traces — and the one most teams add last.

Core Concepts¶

1. Always-On vs One-Shot Profiling¶

Dimension	One-Shot	Continuous
When collected	On-demand, during incident	Always running
Overhead	High (can be >10% during collection)	Low (1–5% steady state)
Usefulness for regressions	Captures current state only	Can compare any two time windows
Production safe	No — changes observed behavior	Yes — designed for production
Tooling	`pprof`, `async-profiler`, `perf`	Pyroscope, Parca, Grafana Pyroscope

One-shot profiling is still useful for local development and targeted debugging. Continuous profiling is what you wire into production infrastructure.

2. Profiling Types¶

CPU profiling — samples the call stack at a fixed interval (typically 100Hz). Shows where the CPU is spending time. Identifies hot functions.

Heap / memory profiling — tracks allocations over time. Identifies which code paths allocate the most memory, pointing at memory leaks and GC pressure.

Goroutine profiling (Go) — shows the current state of all goroutines: running, blocked, waiting on channel, waiting on mutex. Detects goroutine leaks.

Mutex profiling — measures time spent waiting to acquire mutexes. Identifies lock contention bottlenecks.

Block profiling — measures time spent blocked on synchronization primitives (channels, mutexes, syscalls). Broader than mutex-only.

Allocation profiling — samples allocations by call stack. Distinct from heap in that it shows where allocations originate, not where they live.

3. Flame Graphs — Reading Them¶

A flame graph visualizes a call stack sampled many times. Each rectangle is a function. Width represents how much time (or memory, or allocations) that function and its callees consumed:

┌──────────────────────────────────────────────────────────┐
│                        main()                            │  ← wide = most time here
├──────────────────┬───────────────────────────────────────┤
│  handleRequest() │          serveHTTP()                  │
├──────────────────┴──────────────┬────────────────────────┤
│         queryDB()               │   renderTemplate()     │
├──────────────────────┬──────────┴────────────────────────┤
│    sql.(*Rows).Next()│  json.Marshal()                   │
├──────────────────────┘                                   │
│  runtime.cgocall()                                       │
└──────────────────────────────────────────────────────────┘

Name origin: Flame graphs were invented by Brendan Gregg at Netflix in 2011. The name comes from the visual appearance — wide bars at the top look like flames. Gregg created them because existing profiling visualizations (call trees, flat profiles) could not effectively show the full picture of CPU usage across deep call stacks.

Reading rules: - The bottom is the entry point (main, HTTP handler) - The top of each stack is the leaf — where CPU actually spent cycles - Wide plateaus near the top are hot paths worth optimizing - A tall narrow spike means deep call chains but little time — not a problem - Functions that disappear between two time windows caused a regression

Differential flame graphs show the difference between two time ranges. Red = more time, blue = less time. Use them to compare before and after a deploy.

4. Pyroscope Architecture¶

Pyroscope has two modes: push (SDK-based) and pull (agent-based).

Push mode:
┌─────────────────┐       ┌──────────────────┐
│  Application    │──SDK──▶  Pyroscope Server  │
│  (Go/Python/JVM)│       │  (storage + UI)   │
└─────────────────┘       └──────────────────┘

Pull mode:
┌─────────────────┐       ┌──────────────────┐  ┌──────────────────┐
│  Application    │       │  Pyroscope Agent  │──▶  Pyroscope Server │
│  (any language) │◀──scrp─│  (eBPF profiler)  │  │  (storage + UI)   │
└─────────────────┘       └──────────────────┘  └──────────────────┘

Push mode with Go SDK:

import "github.com/grafana/pyroscope-go"

pyroscope.Start(pyroscope.Config{
    ApplicationName: "order-service",
    ServerAddress:   "http://pyroscope:4040",
    Logger:          pyroscope.StandardLogger,
    Tags: map[string]string{
        "region":      "us-east-1",
        "environment": "production",
        "version":     "2.4.1",
    },
    ProfileTypes: []pyroscope.ProfileType{
        pyroscope.ProfileCPU,
        pyroscope.ProfileAllocObjects,
        pyroscope.ProfileAllocSpace,
        pyroscope.ProfileInuseObjects,
        pyroscope.ProfileInuseSpace,
        pyroscope.ProfileGoroutines,
        pyroscope.ProfileMutexCount,
        pyroscope.ProfileMutexDuration,
        pyroscope.ProfileBlockCount,
        pyroscope.ProfileBlockDuration,
    },
})

Push mode with Python SDK:

import pyroscope

pyroscope.configure(
    application_name="billing-service",
    server_address="http://pyroscope:4040",
    tags={
        "environment": "production",
        "version": os.environ.get("APP_VERSION", "unknown"),
    },
)

Pyroscope Server — Docker Compose:

version: "3.8"
services:
  pyroscope:
    image: grafana/pyroscope:latest
    ports:
      - "4040:4040"
    command:
      - server
    environment:
      - PYROSCOPE_STORAGE_PATH=/data
    volumes:
      - pyroscope-data:/data

volumes:
  pyroscope-data:

Pyroscope in Kubernetes (Helm):

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm upgrade --install pyroscope grafana/pyroscope \
  --namespace monitoring \
  --create-namespace \
  --set pyroscope.replicationFactor=2 \
  --set minio.enabled=true \
  --set minio.persistence.size=50Gi

5. eBPF Profiler — Zero-Instrumentation Profiling¶

The eBPF profiler is the most operationally powerful mode. It profiles every process on a node without any application changes:

# Pyroscope eBPF agent as DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: pyroscope-agent
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: pyroscope-agent
  template:
    metadata:
      labels:
        app: pyroscope-agent
    spec:
      hostPID: true
      hostNetwork: true
      containers:
        - name: agent
          image: grafana/pyroscope:latest
          args:
            - agent
          securityContext:
            privileged: true
          env:
            - name: PYROSCOPE_SERVER_ADDRESS
              value: http://pyroscope.monitoring:4040
            - name: PYROSCOPE_SPY_NAME
              value: ebpfspy
          volumeMounts:
            - name: host-proc
              mountPath: /proc
            - name: host-sys
              mountPath: /sys
      volumes:
        - name: host-proc
          hostPath:
            path: /proc
        - name: host-sys
          hostPath:
            path: /sys

Under the hood: eBPF profiling works by attaching a BPF program to the perf_event subsystem. On each CPU timer tick (typically 97-100Hz to avoid aliasing with other system timers), the eBPF program reads the current stack trace and increments a counter in a BPF map. The userspace agent periodically reads these maps and ships the aggregated profiles to the server. The overhead is low because the stack walking happens in kernel context with no context switches.

eBPF profiling requires Linux kernel 4.14+ and works for Go, C, C++, Rust, Node.js, and Python. It captures stack traces without modifying any code.

6. Parca Architecture¶

Parca is a CNCF project with a different design philosophy: it stores profiles in a columnar format optimized for long-term storage and complex queries.

┌─────────────────────────────────────────────┐
│                Parca Agent                   │
│  (DaemonSet — pulls pprof endpoints)         │
│                                              │
│  - Discovers targets via Kubernetes API      │
│  - Scrapes /debug/pprof/* endpoints          │
│  - Relabels and forwards to Parca server     │
└──────────────────┬──────────────────────────┘
                   │ gRPC
                   ▼
┌─────────────────────────────────────────────┐
│               Parca Server                   │
│  - Stores profiles in columnstore            │
│  - Serves ParcaQL queries                    │
│  - Web UI with flame graph visualization     │
└─────────────────────────────────────────────┘

Parca Agent config:

# parca-agent.yaml
node: "${NODE_NAME}"
log-level: info
http-address: :7071
remote-store-address: parca.monitoring:7070
remote-store-insecure: false
remote-store-bearer-token-file: /var/run/secrets/parca/token

kubernetes:
  enabled: true
  node-name: "${NODE_NAME}"

profiling:
  duration: 10s
  cpu:
    enabled: true

ParcaQL — query language:

# All CPU profiles for a service
profiles:type:cpu:nanoseconds:delta{service_name="order-service"}

# Compare two versions
sum(profiles:type:cpu:nanoseconds:delta{
  service_name="order-service",
  version=~"2.4.*"
})

# Top functions by CPU, last 1 hour
topk(10, sum by (function) (
  rate(profiles:type:cpu:nanoseconds:delta{
    service_name="order-service"
  }[1h])
))

7. Kubernetes Profiling with Annotations¶

Both Pyroscope and Parca support auto-discovery via pod annotations:

# Pod template annotations for Pyroscope
metadata:
  annotations:
    profiles.grafana.com/cpu.scrape: "true"
    profiles.grafana.com/cpu.port: "6060"
    profiles.grafana.com/cpu.path: "/debug/pprof/profile"
    profiles.grafana.com/memory.scrape: "true"
    profiles.grafana.com/memory.port: "6060"
    profiles.grafana.com/memory.path: "/debug/pprof/heap"
    pyroscope.io/scrape: "true"
    pyroscope.io/application-name: "order-service"
    pyroscope.io/profile-cpu-enabled: "true"
    pyroscope.io/profile-mem-enabled: "true"
    pyroscope.io/port: "6060"

Your Go service must expose pprof:

import _ "net/http/pprof"

// In main():
go func() {
    log.Println(http.ListenAndServe("localhost:6060", nil))
}()

8. Correlating Profiles with Traces via Exemplars¶

The most powerful use of continuous profiling is correlation with tracing. Pyroscope supports Exemplars — links from a trace span to the profile captured during that span's execution:

// In your trace handler:
ctx, span := tracer.Start(ctx, "processOrder")
defer span.End()

// Tag the profile with the trace ID so the profiler UI can link them
pyroscope.TagWrapper(ctx, pyroscope.Labels(
    "traceID", span.SpanContext().TraceID().String(),
), func(c context.Context) {
    // Your actual work here
    result, err = processOrderInner(c, order)
})

In Grafana, configure the Pyroscope datasource to link from Tempo traces to flame graphs:

# Grafana datasource config
apiVersion: 1
datasources:
  - name: Pyroscope
    type: grafana-pyroscope-datasource
    url: http://pyroscope.monitoring:4040
    jsonData:
      backendType: pyroscope
  - name: Tempo
    type: tempo
    jsonData:
      tracesToProfiles:
        datasourceUid: pyroscope
        profileTypeId: "process_cpu:cpu:nanoseconds:cpu:nanoseconds"
        customQuery: false

9. Finding Memory Leaks¶

Memory leaks in long-running services are one of the best use cases for continuous profiling:

# Using Go's built-in pprof — get heap profile
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/heap

# Capture allocation profile over 30 seconds
go tool pprof http://localhost:6060/debug/pprof/allocs?seconds=30

# Compare two heap profiles to find what grew
go tool pprof -base heap_before.pb.gz heap_after.pb.gz

# In pprof interactive mode:
# top20 — show top 20 functions by allocation
# list <funcname> — show line-level allocation for a function
# web — open flame graph in browser

With Pyroscope, you compare the memory profile before and after a deploy directly in the UI using the "diff" view. Red regions show where allocations increased.

10. Identifying Hot Paths¶

Hot path analysis workflow:

Open the flame graph for the service in the past 1 hour
Look for wide plateaus at the top of the stack — functions where CPU stalls
Check if the hot path is in your code or in a library (look at the leftmost packages)
Drill into the specific function — Pyroscope shows line-level breakdown
Correlate with a deployment: did this hot path appear after a specific commit?

Common hot paths that are easy to miss: - JSON marshaling in tight loops (use streaming marshaler or cache results) - String concatenation instead of strings.Builder - Regular expression compilation on every request (compile once, reuse)

Debug clue: If your Go service's flame graph shows a wide bar at runtime.mallocgc, you have an allocation-heavy hot path. Switch to the heap profile to find exactly which function is allocating. Common fixes: use sync.Pool for frequently allocated objects, pre-allocate slices with make([]T, 0, capacity), and avoid fmt.Sprintf in hot loops (use strconv instead). - Unnecessary allocations in hot code paths (use sync.Pool)

11. Low-Overhead Profiling in Production¶

Production profiling overhead guidelines:

Profile Type	Sampling Rate	Expected CPU Overhead
CPU (pprof)	100Hz	1–3%
Heap	On GC	<1%
Allocs	512KB sample rate	1–2%
Goroutine	Snapshot mode	<0.5%
Mutex	1/10 contended ops	<1%
eBPF (Pyroscope)	97Hz	1–3% per node

To reduce overhead further:

// In Go — set sampling rates explicitly
runtime.SetMutexProfileFraction(10)  // 1 in 10 mutex contention events
runtime.SetBlockProfileRate(1000)     // 1 in 1000 ns of blocking

For Python, the overhead is higher (5–10%). Use a sampling interval of 10ms or longer in production:

pyroscope.configure(
    application_name="billing-service",
    server_address="http://pyroscope:4040",
    sample_rate=100,   # Hz — default 100, lower for less overhead
)

Quick Reference¶

Task	Command / Tool
Start Pyroscope server	`docker run -p 4040:4040 grafana/pyroscope server`
View Go pprof locally	`go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile`
Heap profile	`go tool pprof http://localhost:6060/debug/pprof/heap`
Goroutine profile	`curl http://localhost:6060/debug/pprof/goroutine?debug=2`
All pprof endpoints	`curl http://localhost:6060/debug/pprof/`
Diff two profiles	`go tool pprof -base before.pb.gz after.pb.gz`
CPU profile for 30s	`go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30`
ParcaQL CPU query	`profiles:type:cpu:nanoseconds:delta{service_name="svc"}`
Install Pyroscope Helm	`helm install pyroscope grafana/pyroscope -n monitoring`

Prerequisites¶

Observability Deep Dive (Topic Pack, L2)