eBPF & Modern Linux Observability Footguns¶
Mistakes that turn your observability tools into performance problems, your tracing into noise, and your investigations into dead ends.
1. Running unfiltered traces on busy production servers¶
You run opensnoop without any filter on a server handling 10,000 requests per second. Every syscall is traced. The tracing overhead pushes CPU from 60% to 85%. You've just made the performance problem you were investigating worse.
Fix: Always filter by PID, process name, or specific syscall. opensnoop -p $(pgrep myapp) not opensnoop. Set a timeout: timeout 30 opensnoop -p PID. Capture to a file and analyze offline if the output is large.
2. Misreading biolatency as "disk is fine"¶
biolatency shows a clean histogram with all I/O under 1ms. You declare the disk is not the problem. But your application uses NFS, and NFS latency doesn't show up in block I/O tracing. The application is waiting 200ms for network filesystem operations.
Fix: Match the tracing tool to the storage layer. Block device I/O: biolatency. Local filesystem: ext4slower/xfsslower. Network filesystem: nfsslower. Always ask: "what storage layer is my application actually using?"
3. Installing kernel headers for the wrong kernel version¶
You install linux-headers-generic but your running kernel is a specific version that doesn't match. BCC fails to compile its modules. You waste 30 minutes of an active incident on tooling issues.
Fix: Always use linux-headers-$(uname -r) (the exact running kernel). If the exact version isn't available, you need to reboot into a kernel that has matching headers. Better yet: include headers in your base image so they're always available.
4. Forgetting eBPF needs capabilities in containers¶
You deploy a troubleshooting container with BCC tools. You exec into it and try to run execsnoop. "Operation not permitted." Your container doesn't have the BPF or SYS_ADMIN capabilities. In a Kubernetes pod with restricted security contexts, eBPF tools simply won't work.
Fix: Pre-build a privileged debug pod template for your cluster:
# debug-pod.yaml — ready for eBPF tracing
securityContext:
capabilities:
add: ["SYS_ADMIN", "BPF", "PERFMON"]
5. Tracing the symptom instead of the cause¶
TCP retransmissions are high. You spend an hour tracing the network stack with eBPF. Detailed packet-level analysis shows retransmissions on specific connections. You escalate to the network team. After two days, someone notices the receiving application has a full receive buffer because it's blocked on a database query. The network was fine the entire time.
Fix: Follow the triage sequence: CPU scheduling first, then disk, then network, then application. Most "network" problems are actually application problems manifesting as TCP backpressure. Check ss -tnp receive queue sizes before diving deep into network tracing.
6. Leaving tracing probes attached in production¶
You attach a bpftrace probe to investigate an issue. The issue resolves. You close your SSH session. But the bpftrace process is still running in the background because you ran it with nohup or screen. It's been tracing every syscall for three weeks, consuming CPU and memory for a ring buffer nobody is reading.
Fix: Always run tracing tools with timeout. Never nohup or background them. After investigation, verify no lingering probes: bpftool prog list shows all loaded eBPF programs. Remove orphans: bpftool prog detach.
7. Expecting eBPF to work on kernel 3.x¶
Your fleet has a mix of kernel versions. Some servers run RHEL 7 with kernel 3.10. You try to deploy BCC tools fleet-wide. Half the fleet fails because kernel 3.10 has minimal eBPF support — no kprobes, no tracepoints, no maps.
Fix: Know your minimum kernel requirements. BCC tools: kernel 4.1+ minimum, 4.9+ recommended. bpftrace: kernel 4.9+ minimum. BPF CO-RE: kernel 5.8+. For older kernels, fall back to perf, ftrace, or SystemTap.
8. Using eBPF when simpler tools suffice¶
Application is slow. You immediately reach for bpftrace to write custom kernel probes. After 45 minutes of tracing, you discover the application is doing a full table scan on every request. top and strace -p PID -c would have shown this in 30 seconds.
Fix: Start simple. Check top, iostat, vmstat, ss first. These are instant and require no setup. Escalate to eBPF only when the simpler tools don't explain the behavior. eBPF is for "I can see the symptom but not the cause" situations, not for initial triage.
9. Tracing in production without alerting the team¶
You SSH into a production server and start running bpftrace probes. Another engineer sees unfamiliar processes consuming resources and panics. They kill your tracing, or worse, assume the server is compromised and start incident response.
Fix: Before tracing in production, communicate. Post in the team channel: "Running eBPF tracing on prod-web-03 for the next 15 minutes to investigate latency issue. PID will be [X]. No action needed." When done, confirm: "Tracing complete on prod-web-03, probes detached."
10. Building dashboards on eBPF metrics without understanding overhead¶
You discover eBPF-based Prometheus exporters and decide to export everything: per-syscall latency histograms, per-connection metrics, per-file-open counters. Your node exporter now generates 50,000 time series per host. Prometheus scrape takes 10 seconds. Your monitoring is now a performance problem.
Fix: eBPF exporters should export aggregated metrics, not raw events. Export histograms and counters, not individual traces. Start with 5-10 key metrics (TCP retransmit rate, disk I/O latency p99, run queue latency p99). Add more only when you have a specific question they answer.