Skip to content

Continuous Profiling — Street-Level Ops

Quick Diagnosis Commands

# Check if pprof is exposed on a Go service
curl -s http://localhost:6060/debug/pprof/ | grep -E "(cpu|heap|goroutine|mutex|block|allocs)"

# Grab a 30-second CPU profile and open in browser
go tool pprof -http=:8081 'http://localhost:6060/debug/pprof/profile?seconds=30'

# Goroutine dump — see all goroutines with stacks (no profiling needed)
curl -s 'http://localhost:6060/debug/pprof/goroutine?debug=2' | head -200

# Count goroutines quickly (a leak shows thousands)
curl -s 'http://localhost:6060/debug/pprof/goroutine?debug=1' | head -5

# Heap summary without full profile
curl -s 'http://localhost:6060/debug/pprof/heap?debug=1' | head -30

# Check Pyroscope server health
curl -s http://pyroscope:4040/health

# Check Pyroscope ingestion rate (self-metrics)
curl -s http://pyroscope:4040/metrics | grep pyroscope_ingester

# List apps being profiled in Pyroscope
curl -s http://pyroscope:4040/api/apps | jq '.[].name'

# Check Parca server health
curl -s http://parca:7070/api/v1/health

# Check Parca agent is scraping
kubectl logs -n monitoring -l app=parca-agent --tail=50 | grep -E "(scrape|profile|error)"

Gotcha: pprof Endpoint Is Listening But Returns Empty Profiles

Rule: Importing net/http/pprof registers handlers on http.DefaultServeMux. If your service uses a custom mux (e.g., gorilla/mux, chi, gin), the handlers are registered but unreachable at your service's port.

Diagnosis:

# Check if the handlers are registered
curl -v http://localhost:8080/debug/pprof/
# 404 = custom mux, handlers not mounted
# 200 = working

# Fix A: Run pprof on a separate port using DefaultServeMux
go func() {
    log.Println(http.ListenAndServe(":6060", nil))  // nil = DefaultServeMux
}()

# Fix B: Mount pprof handlers on your custom mux (chi example)
import "net/http/pprof"
r.HandleFunc("/debug/pprof/", pprof.Index)
r.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
r.HandleFunc("/debug/pprof/profile", pprof.Profile)
r.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
r.HandleFunc("/debug/pprof/trace", pprof.Trace)


Gotcha: CPU Profile Shows 100% in runtime.gcBgMarkWorker

Rule: When your CPU flame graph is dominated by runtime.gcBgMarkWorker, runtime.mallocgc, or runtime.scanobject, the bottleneck is the garbage collector, not your application logic. The real problem is allocation rate — find where you are allocating, not where GC is running.

# Get allocation profile — find the actual source of allocations
go tool pprof -http=:8081 'http://localhost:6060/debug/pprof/allocs?seconds=30'
# In the UI: switch to "alloc_space" view (total bytes allocated since start)
# vs "inuse_space" view (bytes currently in heap)

# Quick top allocators in CLI
go tool pprof http://localhost:6060/debug/pprof/heap
(pprof) top20 -cum
(pprof) list processOrder    # line-level detail for a specific function

Common fixes: - Use sync.Pool for frequently allocated/freed objects - Preallocate slices with make([]T, 0, expectedCap) - Use strings.Builder instead of + for string concatenation in loops - Avoid converting []byte to string unnecessarily

Remember: CPU profile dominated by GC? Look at the allocs profile, not the CPU profile. The CPU profile shows where time is spent (GC), but the allocs profile shows the root cause (who is allocating). Fix the allocator and the GC pressure disappears.


Pattern: Diff Profiles to Catch Regressions

When a deploy causes a performance regression visible in metrics, use differential profiling to find the cause:

# Before the deploy — capture a baseline profile
go tool pprof -proto http://prod-service-old:6060/debug/pprof/profile?seconds=60 > before.pb.gz

# After the deploy — capture a comparison profile
go tool pprof -proto http://prod-service-new:6060/debug/pprof/profile?seconds=60 > after.pb.gz

# Diff them — red = worse, blue = better
go tool pprof -http=:8081 -base before.pb.gz after.pb.gz

In Pyroscope UI: 1. Select "Comparison" view 2. Left panel: time range before deploy 3. Right panel: time range after deploy 4. Switch to "Diff" mode — red functions got worse, blue got better


Scenario: Goroutine Leak Investigation

A service's memory is growing slowly and never drops after traffic subsides. Goroutine count is climbing.

# Step 1: Confirm goroutine count is elevated
curl -s 'http://prod-svc:6060/debug/pprof/goroutine?debug=1' | head -3
# goroutine profile: total 4821   <-- way too many for this service

# Step 2: Get full goroutine dump
curl -s 'http://prod-svc:6060/debug/pprof/goroutine?debug=2' > goroutine_dump.txt

# Step 3: Find the most common stack
sort -k1 -n goroutine_dump.txt | uniq -c | sort -rn | head -20
# Look for goroutines waiting on: channel receive, net/http, database/sql

# Step 4: With Pyroscope — look at goroutine profile over time
# Filter by service, select goroutine profile type
# Look for stacks that are growing in the timeline

# Step 5: Common culprits
grep -A5 "goroutine" goroutine_dump.txt | grep -E "(chan receive|semacquire|select)"

Typical goroutine leak patterns:

Debug clue: A goroutine count that climbs linearly with traffic and never drops is almost always a leaked HTTP client or database connection. The goroutine dump will show hundreds of identical stacks blocked on chan receive or semacquire -- the stack trace tells you exactly which resource pool is exhausted.

  • HTTP client requests without context timeouts (goroutine waits forever)
  • Goroutines writing to an unbuffered channel with no reader
  • time.Tick() in a function (creates a goroutine per call, never GC'd — use time.NewTicker and stop it)
  • Database connection pool exhausted — new goroutines queue waiting
// Fix: always use context with deadline for HTTP clients
ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
defer cancel()
resp, err := client.Do(req.WithContext(ctx))

// Fix: use time.NewTicker, not time.Tick, and stop it
ticker := time.NewTicker(5 * time.Second)
defer ticker.Stop()

Pattern: Pyroscope Push vs Pull Decision

Use push SDK when:          Use pull/eBPF when:
- Go, Python, JVM app       - Mixed or unknown languages
- Need line-level accuracy  - No code changes allowed
- Span-tagged profiles      - Profile all processes on node
- Dynamic tagging           - Sidecar/service mesh approach

Push SDK — dynamic labeling for multi-tenant profiles:

// Tag profiles by request type for filtering in UI
func handleRequest(w http.ResponseWriter, r *http.Request) {
    pyroscope.TagWrapper(r.Context(), pyroscope.Labels(
        "endpoint", r.URL.Path,
        "method", r.Method,
        "user_tier", getUserTier(r),
    ), func(ctx context.Context) {
        doWork(ctx, r)
    })
}

Pull mode — Pyroscope scraping pprof endpoints:

# pyroscope server scrape config
scrape-configs:
  - job-name: go-services
    enabled-profiles:
      - process_cpu
      - memory
      - mutex
      - block
      - goroutines
    static-configs:
      - targets:
          - order-service:6060
          - billing-service:6060
        labels:
          environment: production
    scrape-interval: 15s
    profile-path: /debug/pprof
    scheme: http

Gotcha: eBPF Profiler Requires Privileged Access

Rule: The Pyroscope eBPF agent and Parca agent require privileged: true and hostPID: true in Kubernetes. In restricted clusters with PodSecurityAdmission, you need a dedicated namespace with privileged policy, or a SecurityContext that grants CAP_SYS_ADMIN and CAP_PERFMON.

# Minimum security context for eBPF profiler
securityContext:
  privileged: true  # needed for BPF syscalls
  # OR if using capabilities (kernel >= 5.8):
  capabilities:
    add:
      - SYS_ADMIN
      - SYS_PTRACE
      - PERFMON
      - BPF

# Namespace must allow privileged pods
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
  labels:
    pod-security.kubernetes.io/enforce: privileged

If you cannot grant privileged access, use push SDKs or pull-mode pprof scraping instead.


Scenario: Finding a Memory Leak with Pyroscope

Production service's memory grows from 200MB to 2GB over 6 hours and triggers OOMKilled.

# Step 1: Check if the growth is in Go heap or OS memory
kubectl top pod order-service-xxx -n production
# If RSS >> heap, check for CGo allocations or mmap usage

# Step 2: In Pyroscope UI
# - Select service: order-service
# - Profile type: memory:inuse_space
# - Look at the last 6 hours — where does memory grow?

# Step 3: Identify the growing stack
# In Pyroscope diff view:
# Left = 30 min ago, Right = now
# Differential view shows what grew

# Step 4: Force a heap dump for detailed analysis
curl http://prod-svc:6060/debug/pprof/heap > heap_now.pb.gz
go tool pprof -http=:8081 heap_now.pb.gz
# View: inuse_space — what is currently allocated
# View: alloc_space — what has been allocated total (shows leaky code path)

# Step 5: Check for cache that never evicts
# Common: in-memory map used as cache with no TTL or size limit
grep -r "map\[string\]" internal/ | grep -v "_test.go"

Emergency: OOMKilled Service — Profiling Under Pressure

When a service keeps OOMKilling and you have minutes before the next crash:

# Get a heap profile NOW before next OOM
kubectl exec -it order-service-xxx -- curl -s localhost:6060/debug/pprof/heap > /tmp/heap_$(date +%s).pb.gz

# If exec is unavailable, use port-forward
kubectl port-forward pod/order-service-xxx 6060:6060 &
go tool pprof -proto http://localhost:6060/debug/pprof/heap > /tmp/heap_$(date +%s).pb.gz

# Get goroutine count (goroutine leaks also cause memory growth)
curl -s localhost:6060/debug/pprof/goroutine?debug=1 | head -3

# Temporarily increase memory limit to buy time for investigation
kubectl set resources deployment/order-service --limits=memory=4Gi -n production

# Analyze offline
go tool pprof -http=:8081 /tmp/heap_*.pb.gz

Pattern: Profile-Guided Optimization Workflow

  1. Enable continuous profiling in staging with same traffic pattern as production (replay logs or use shadow traffic)
  2. Run load test for 30 minutes — capture baseline profiles
  3. Identify the top 3 flame graph hot paths
  4. Optimize each path (use benchmark tests to validate locally)
  5. Re-run load test — compare profiles before/after
  6. Deploy to production — confirm improvement in Pyroscope timeline
# Run Go benchmark with CPU profile
go test -bench=BenchmarkProcessOrder -benchtime=30s \
    -cpuprofile=cpu.prof \
    -memprofile=mem.prof \
    ./internal/order/

# Analyze benchmark profiles
go tool pprof -http=:8081 cpu.prof
go tool pprof -http=:8081 mem.prof

Useful One-Liners

# Watch goroutine count in real time
watch -n2 'curl -s localhost:6060/debug/pprof/goroutine?debug=1 | head -3'

# Export all pprof profiles from a pod to local disk
for profile in heap goroutine threadcreate block mutex allocs; do
    curl -s "http://localhost:6060/debug/pprof/$profile" > "${profile}_$(date +%s).pb"
done

# Quick memory summary without profiling tool
curl -s 'http://localhost:6060/debug/pprof/heap?debug=1' | grep -E "^(HeapSys|HeapAlloc|HeapInuse|HeapIdle)"

# Top Pyroscope apps by ingestion rate
curl -s http://pyroscope:4040/metrics | grep pyroscope_ingester_chunks_created_total | sort -t= -k2 -rn | head -10

# Port-forward Pyroscope UI
kubectl port-forward svc/pyroscope 4040:4040 -n monitoring

# Port-forward Parca UI
kubectl port-forward svc/parca 7070:7070 -n monitoring

# Force GC then take heap snapshot (for accurate inuse_space)
curl -s 'http://localhost:6060/debug/pprof/gc' 2>/dev/null; curl -s 'http://localhost:6060/debug/pprof/heap' > heap_post_gc.pb.gz