Continuous Profiling Footguns¶
1. Exposing pprof on the Public Port¶
Your service exports pprof handlers on the same port as the production API (0.0.0.0:8080/debug/pprof/). Anyone who can reach your service can now extract a full CPU profile, heap dump with live memory contents, and goroutine stacks including credentials held in variables.
Fix: Run pprof on a separate internal-only port (localhost:6060). Never bind pprof to 0.0.0.0 in production unless you are behind a network policy or authentication middleware. In Kubernetes, never add the pprof port to the Service spec — use kubectl port-forward for access.
2. Profiling Without Frame Pointers — Empty Flame Graphs¶
You enable the Pyroscope eBPF agent and get flame graphs that are a flat list of [unknown] frames. Stack unwinding fails silently because the binaries were compiled without frame pointers.
Fix: Go 1.21+ enables frame pointers by default. For older Go versions, rebuild with GOFLAGS=-buildmode=exe. For C/C++/Rust, compile with -fno-omit-frame-pointer. For JVM, use -XX:+PreserveFramePointer. Verify with: objdump -d your-binary | grep -c "push.*rbp" — a nonzero count confirms frame pointers are present.
3. Using time.Tick Instead of time.NewTicker¶
time.Tick is a convenience function that creates a goroutine internally and never allows you to stop it. Calling it inside a function that runs on every request creates a new goroutine per call. These goroutines accumulate and never get garbage collected, causing a slow goroutine leak that looks like a memory leak in profiles.
Fix: Always use time.NewTicker(d) and call ticker.Stop() when done, typically with defer. Reserve time.Tick only for program-lifetime tickers at package level.
4. Misreading Flat vs Cumulative in pprof¶
You open a CPU profile, see json.Marshal at the top of the flat view, and immediately optimize your JSON serialization. Flat view shows time spent in the function itself. But json.Marshal is there because it was called by dozens of other functions — your HTTP handler, your cache layer, your logging middleware. The real problem is calling json.Marshal in a hot loop inside your cache layer, which flat view buries.
Fix: Always look at both views. Flat: where the CPU was at sample time (leaf functions). Cumulative: total time including callees (the entire subtree). Sort by cumulative to find the entry points driving load, sort by flat to find the leaf functions where cycles burn. Use the flame graph to understand both dimensions at once.
5. Capturing Profiles During Low-Traffic Periods¶
You schedule profiling captures overnight when traffic is low. The profiles show almost nothing because the service is idle. You conclude there are no performance issues. Production runs 20x the traffic during business hours and has a clear hot path you never see.
Fix: Continuous profiling solves this automatically — it always captures. For one-shot profiling, always capture under representative load. Run your load test or replay production traffic while profiling. For Pyroscope, look at the time range that corresponds to peak traffic hours, not the current moment.
6. Not Setting the GOMAXPROCS Rate for Container Environments¶
Go uses the number of OS CPUs to set GOMAXPROCS. In a container limited to 0.5 CPU, GOMAXPROCS defaults to the host's 32-core count. Go spins up 32 OS threads competing for 0.5 CPU, causing excessive context switching. Your CPU profile shows runtime.schedule and runtime.lock dominating — not your application code. You optimize the wrong thing.
Fix: Use the automaxprocs library, which reads cgroup CPU quota and sets GOMAXPROCS correctly:
GOMAXPROCS=$(python3 -c "import math,os; print(max(1,math.ceil(float(open('/sys/fs/cgroup/cpu.max').read().split()[0])/float(open('/sys/fs/cgroup/cpu.max').read().split()[1]))))")
7. Treating Profiling Overhead as Negligible Without Measuring¶
You assume the Pyroscope SDK adds negligible overhead and enable all 8 profile types (CPU, heap, allocs, inuse, goroutines, mutex, block, alloc_count, alloc_space) at default rates. In practice, mutex profiling at runtime.SetMutexProfileFraction(1) samples every contended mutex operation. Under high concurrency, this adds 5–15% CPU overhead.
Fix: Enable only the profile types you actively use. For a typical production Go service: CPU and heap are essential, goroutines are cheap. Mutex and block profiling should use fractional sampling: runtime.SetMutexProfileFraction(10) (1 in 10 events) and runtime.SetBlockProfileRate(10000) (in nanoseconds). Measure overhead in staging before enabling in production.
8. Losing Profiles During Horizontal Pod Autoscaling¶
Your service HPA scales from 2 to 20 pods under load, then back to 2. The 18 pods that scaled down during the traffic spike are gone. If you were not running a centralized profiler (Pyroscope server or Parca), all those profiles are lost. The performance issue happened on a pod that no longer exists.
Fix: Continuous profiling only works for post-hoc analysis if profiles are shipped to a persistent store in real time. Use push-mode SDKs that write to a central server, or pull-mode agents that scrape and forward continuously. Never rely on pulling profiles from pods — pods are ephemeral. Validate that profiles are flowing to the server before relying on them for incident response.
9. Confusing PGO (Profile-Guided Optimization) with Runtime Profiling¶
A developer reads about Go's PGO feature, collects a CPU profile, feeds it into the build, and considers the performance problem solved. PGO improves compiled binary performance by 2–10% by optimizing inlining decisions. It does not fix algorithmic problems, memory leaks, or goroutine leaks. The profile used for PGO and the profile used for ops debugging serve different purposes.
Fix: Use continuous profiling for ongoing operational diagnosis. Use PGO profiles (representative production workload profiles) as a build-time input to improve compiler decisions. These are complementary: PGO → compile-time optimization; continuous profiling → runtime diagnosis. PGO alone will not fix a goroutine leak or a cache-busting memory allocation.
10. Running the Parca Agent Without Matching Kernel Version¶
Parca and Pyroscope eBPF agents use BPF CO-RE (Compile Once, Run Everywhere) but still have minimum kernel version requirements. Deploying on kernel 4.9 (common on older Amazon Linux 2 AMIs) produces silent failures — the agent starts, reports healthy, but collects nothing. No error in logs, no profiles in the UI.
Fix: Check kernel version before deploying eBPF agents:
uname -r
# Pyroscope eBPF: requires >= 4.14 (basic), >= 5.8 for CO-RE features
# Parca Agent: requires >= 5.3 for BPF ringbuf, >= 5.8 for full feature set
11. Not Enabling Profiling in Staging Because "It Costs Too Much"¶
Teams disable continuous profiling in staging to cut costs, then only enable it in production. Performance regressions that would be caught in staging (before they affect customers) go undetected. When a regression ships to production and causes an incident, the team has no pre-regression baseline to compare against.
Fix: Run continuous profiling in all environments. Use a smaller retention window in staging (1–3 days vs 30 days in production) to keep costs manageable. The value of catching a regression before it hits production far exceeds the profiler hosting cost. Treat profiling infrastructure the same as monitoring infrastructure — never skip it in staging.
12. Forgetting That Flame Graphs Only Show Sampled Stacks¶
A function that causes problems only 0.1% of the time (a rare lock contention, a specific code path under unusual input) will not appear in a CPU flame graph with 100Hz sampling and a 30-second capture window. You declare the service "clean" because the flame graph looks healthy, missing a latency issue that affects a small but important subset of requests.
Fix: CPU flame graphs are statistical — they show what happens frequently, not what happens occasionally. For tail latency issues (p99, p999), combine profiling with tracing. Use Pyroscope + Tempo exemplar linking to find the specific trace that was slow, then examine the matching profile. For rare code paths, add explicit timing metrics rather than relying on sampling.