Portal | Level: L3: Advanced | Topics: eBPF, Linux Fundamentals | Domain: Linux
eBPF & Modern Linux Observability - Primer¶
Why This Matters¶
For decades, deep Linux observability meant either reading /proc counters (limited) or loading kernel modules (dangerous). eBPF changes the game entirely. It lets you run sandboxed programs inside the kernel without writing kernel modules, without rebooting, and without risking a panic. You get function-level tracing, network packet inspection, and performance profiling — all in production, with negligible overhead.
If you've ever stared at top during a performance issue and wished you could see which exact system calls were slow, which TCP connections were retransmitting, or which files were being opened — eBPF tools give you that visibility. This is L3 advanced territory, but every senior ops person needs to know what's available.
Core Concepts¶
Name origin: eBPF stands for "extended Berkeley Packet Filter." The original BPF was created by Steven McCanne and Van Jacobson at Lawrence Berkeley National Laboratory in 1992 for efficient packet filtering (used by tcpdump). In 2014, Alexei Starovoitov and Daniel Borkmann reimplemented BPF as a general-purpose in-kernel virtual machine, adding maps, helper functions, and a verifier. The "e" (extended) stuck even though modern eBPF has virtually nothing in common with the original BPF packet filter. The community now considers "eBPF" a standalone term, not an acronym.
1. What eBPF Actually Is¶
Traditional Kernel Instrumentation:
User Space Kernel Space
┌──────────┐ ┌──────────────────┐
│ Your App │──────▶│ System Calls │
└──────────┘ │ │
│ ??? (black box) │
│ │
│ Hardware │
└──────────────────┘
To see inside: load a kernel module (risky)
or recompile the kernel (impractical)
With eBPF:
User Space Kernel Space
┌──────────┐ ┌──────────────────┐
│ Your App │──────▶│ System Calls │
└──────────┘ │ ↕ eBPF probe │
┌──────────┐ │ Scheduler │
│ BCC/bpf- │◀──────│ ↕ eBPF probe │
│ trace │ │ Network stack │
└──────────┘ │ ↕ eBPF probe │
│ Block I/O │
│ ↕ eBPF probe │
└──────────────────┘
eBPF programs:
- Run in a kernel sandbox (verified before loading)
- Cannot crash the kernel (verifier guarantees safety)
- Can attach to thousands of kernel and user-space functions
- Negligible performance overhead (< 1% typical)
2. The eBPF Safety Model¶
| Property | Guarantee |
|---|---|
| Memory safety | Verifier checks all memory accesses before loading |
| Termination | Programs must provably terminate (bounded loops only) |
| Privilege | Requires CAP_BPF or root (not arbitrary user access) |
| Stack size | Limited to 512 bytes (forces efficient code) |
Analogy: eBPF is to the kernel what JavaScript is to the browser. Before JavaScript, you had to ask the browser vendor to add new features. With JavaScript, you run custom code inside the browser's sandbox. Similarly, before eBPF, you had to wait for kernel developers to add new tracing features (or write a dangerous kernel module). With eBPF, you inject sandboxed programs into the running kernel on demand. The verifier is the equivalent of the browser's sandbox — it guarantees your code cannot crash the host. | Helper functions | Can only call approved kernel helpers |
eBPF Program Lifecycle:
1. Write program (C, or via BCC/bpftrace)
2. Compile to eBPF bytecode
3. Verifier checks safety (rejects unsafe programs)
4. JIT compiles to native machine code
5. Attaches to kernel hook point (kprobe, tracepoint, etc.)
6. Runs on every event at the hook point
7. Sends data to user space via maps or ring buffers
8. Detach when done (clean removal, no reboot)
3. BCC Tools — The Essential Toolkit¶
BCC (BPF Compiler Collection) provides ready-to-use tools for common observability tasks. These are production-safe and well-tested.
# Install BCC tools
# RHEL/CentOS
dnf install -y bcc-tools
# Ubuntu/Debian
apt install -y bpfcc-tools linux-headers-$(uname -r)
# Tools are typically in /usr/share/bcc/tools/ or available as commands
Process and Execution Tracing¶
| Tool | What It Shows | Use Case |
|---|---|---|
execsnoop |
Every new process execution | "What's spawning all these processes?" |
opensnoop |
Every file open() call | "Who's reading my config file?" |
exitsnoop |
Process exits with return code | "Why do processes keep dying?" |
runqlat |
Scheduler run queue latency | "Are processes waiting too long to run?" |
cpudist |
On-CPU time distribution | "How long are tasks actually running?" |
# Watch every process that starts (like a system-wide strace for exec)
execsnoop-bpfcc
# Output:
# PCOMM PID PPID RET ARGS
# bash 14501 14500 0 /bin/bash
# curl 14502 14501 0 /usr/bin/curl -s https://api.example.com
# Watch every file being opened
opensnoop-bpfcc
# Output:
# PID COMM FD ERR PATH
# 14500 nginx 7 0 /etc/nginx/nginx.conf
# 14500 nginx 8 0 /var/log/nginx/access.log
# Show scheduler queue latency (are CPUs overloaded?)
runqlat-bpfcc
# Output: histogram of time (usec) processes waited in the run queue
# If you see significant time > 1ms, your CPUs are saturated
Network Tracing¶
| Tool | What It Shows | Use Case |
|---|---|---|
tcpconnect |
Every outbound TCP connection | "What's this process connecting to?" |
tcpaccept |
Every inbound TCP connection | "Who's connecting to this server?" |
tcpretrans |
TCP retransmissions | "Why is my network slow?" |
tcplife |
TCP session lifecycle (connect → close) | "How long do connections live?" |
tcptracer |
TCP connect/accept/close events | Combined network event stream |
# Show all outbound TCP connections as they happen
tcpconnect-bpfcc
# Output:
# PID COMM IP SADDR DADDR DPORT
# 14501 curl 4 10.0.1.50 93.184.216.34 443
# 14502 python 4 10.0.1.50 10.0.1.100 5432
# Show TCP retransmissions (network quality indicator)
tcpretrans-bpfcc
# Output:
# TIME PID IP LADDR:LPORT RADDR:RPORT STATE
# 14:21:53 0 4 10.0.1.50:443 10.0.2.30:52134 ESTABLISHED
# High retransmit rate = network congestion or packet loss
# Show TCP session lifetimes
tcplife-bpfcc
# Output:
# PID COMM IP LADDR LPORT RADDR RPORT TX_KB RX_KB MS
# 14500 nginx 4 10.0.1.50 443 10.0.2.1 52134 50 2 1523
# Long-lived connections with low throughput? Possible connection leak.
Disk and I/O Tracing¶
| Tool | What It Shows | Use Case |
|---|---|---|
biolatency |
Block I/O latency histogram | "Is my disk slow?" |
biosnoop |
Every block I/O operation | "What's reading/writing to disk?" |
ext4slower |
Slow ext4 filesystem operations | "Which file ops are taking too long?" |
filetop |
Top files by I/O | "What's generating all this disk activity?" |
cachestat |
Page cache hit/miss ratio | "Is my application hitting disk or cache?" |
# Show block I/O latency distribution
biolatency-bpfcc
# Output: histogram of disk operation latency
# If you see operations > 10ms frequently, your storage is struggling
# Show every disk I/O operation with latency
biosnoop-bpfcc
# Output:
# TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms)
# 0.000 postgres 1234 sda R 12345678 4096 0.45
# 0.001 postgres 1234 sda W 12345680 8192 1.23
# Show which files are generating the most I/O
filetop-bpfcc
# Refreshes like top, showing files ranked by read/write bytes
# Page cache hit ratio (are reads hitting RAM or disk?)
cachestat-bpfcc
# Output:
# HITS MISSES DIRTIES HITRATIO BUFFERS_MB
# 15843 234 45 98.54% 1024
# Hit ratio < 90%? You may need more RAM or your working set is too large
Fun fact: BCC was created by Brenden Blanco at PLUMgrid (later IOVisor). The tools were popularized by Brendan Gregg, who wrote most of the 100+ BCC tools in the repository while at Netflix. Gregg's 2019 book "BPF Performance Tools" is the definitive reference. The BCC tools are deliberately named after the questions they answer:
execsnoopsnoops exec calls,biolatencyshows block I/O latency,tcpretranstraces TCP retransmissions.
4. bpftrace — One-Liners for Deep Tracing¶
bpftrace is a high-level tracing language for eBPF. Think of it as awk for kernel tracing.
# Install bpftrace
dnf install -y bpftrace # RHEL
apt install -y bpftrace # Ubuntu
# Count system calls by process name
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
# Show which files a specific process is reading
bpftrace -e 'tracepoint:syscalls:sys_enter_openat /comm == "nginx"/ {
printf("%s opened %s\n", comm, str(args->filename));
}'
# Histogram of read() sizes by process
bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
@bytes[comm] = hist(args->ret);
}'
# Trace DNS lookups (getaddrinfo)
bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:getaddrinfo {
printf("%s is resolving DNS: %s\n", comm, str(arg0));
}'
# Show latency of disk I/O by disk device
bpftrace -e 'tracepoint:block:block_rq_issue { @start[args->dev, args->sector] = nsecs; }
tracepoint:block:block_rq_complete /@start[args->dev, args->sector]/ {
@usecs = hist((nsecs - @start[args->dev, args->sector]) / 1000);
delete(@start[args->dev, args->sector]);
}'
# Trace why a process is waking up (useful for unexpected CPU usage)
bpftrace -e 'tracepoint:sched:sched_wakeup /args->comm == "myapp"/ {
printf("woke by: %s (PID %d)\n", comm, pid);
}'
5. Real Production Debugging Scenarios¶
Scenario: Mystery Latency Spikes¶
Symptom: API latency spikes to 5s every few minutes. CPU, memory, disk look fine.
Investigation with eBPF:
Step 1: Check if it's network
tcpretrans-bpfcc
→ Result: No retransmissions. Network is clean.
Step 2: Check if it's disk I/O
biolatency-bpfcc
→ Result: All I/O under 1ms. Disk is fast.
Step 3: Check if it's scheduling (CPU contention)
runqlat-bpfcc
→ Result: Spikes of 50-200ms run queue latency every 2 minutes.
Something is hogging the CPU and displacing our app.
Step 4: Find what's consuming CPU at spike time
bpftrace -e 'profile:hz:99 { @[comm] = count(); }'
→ Result: "logrotate" consuming massive CPU every 2 minutes.
Root cause: logrotate is compressing a 10GB log file in-place.
The compression consumes 100% of one CPU core for several seconds,
causing scheduler delays for other processes.
Fix: Move log compression to off-peak hours or use zstd (faster).
Scenario: Connection Leak¶
Symptom: Server gradually runs out of file descriptors. Connections build up.
Investigation with eBPF:
Step 1: See what's connecting
tcplife-bpfcc
→ Result: Thousands of connections from "myapp" to Redis,
all lasting > 300 seconds with zero bytes transferred.
Step 2: Confirm connections are idle
tcptracer-bpfcc
→ Result: Connections are being opened but never closed.
Root cause: Connection pool not returning connections after timeout.
Fix: Configure connection pool idle timeout and max lifetime.
6. When to Use What¶
Decision Tree: Which eBPF Tool?
"Something is slow"
→ Is it CPU? ──────────▶ runqlat, cpudist, profile
→ Is it disk? ─────────▶ biolatency, biosnoop, ext4slower
→ Is it network? ──────▶ tcpretrans, tcplife, tcpconnect
→ Not sure? ───────────▶ runqlat first (most common hidden cause)
"Something is happening that shouldn't be"
→ Unexpected processes? ▶ execsnoop
→ Unexpected file access? ▶ opensnoop
→ Unexpected connections? ▶ tcpconnect, tcpaccept
"I need to understand a specific process"
→ What syscalls? ───────▶ bpftrace tracepoint:syscalls:*
→ What files? ──────────▶ opensnoop -p PID
→ What network? ────────▶ tcpconnect -p PID
→ Custom tracing? ─────▶ bpftrace (write your own)
Common Pitfalls¶
Under the hood: BPF CO-RE (Compile Once, Run Everywhere), introduced in kernel 5.8, solved eBPF's biggest deployment pain point. Before CO-RE, eBPF programs had to be compiled on the target machine because they contained kernel struct offsets that varied between kernel versions. CO-RE uses BTF (BPF Type Format) metadata embedded in the kernel to relocate struct accesses at load time. This means you can compile an eBPF program once and run it on any kernel version that has BTF enabled — no headers required.
- Forgetting to install kernel headers — eBPF tools need headers matching your running kernel.
apt install linux-headers-$(uname -r)ordnf install kernel-devel-$(uname -r). - Running eBPF tools as non-root without CAP_BPF — Most tools need root or specific capabilities. Don't chmod them to SUID — use capabilities properly.
- Leaving tracing running in production — BCC tools have low overhead but not zero. Run them during investigation, not permanently. For continuous monitoring, use purpose-built eBPF-based exporters.
- Interpreting histograms without understanding the baseline — A biolatency histogram means nothing if you don't know what normal looks like. Capture baselines during healthy operation.
- Expecting eBPF on old kernels — Full eBPF support requires kernel 4.9+. Many advanced features need 5.x+. BPF CO-RE (compile once, run everywhere) needs 5.8+. Check your kernel version first.
- bpftrace one-liners that trace too broadly — Tracing all syscalls on a busy server generates massive output. Always filter by process name, PID, or specific syscall.
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
- Observability Deep Dive (Topic Pack, L2)
Next Steps¶
- Runtime Security with Falco (Topic Pack, L2)
Related Content¶
- Linux Performance Tuning (Topic Pack, L2) — eBPF, Linux Fundamentals
- /proc Filesystem (Topic Pack, L2) — Linux Fundamentals
- Advanced Bash for Ops (Topic Pack, L1) — Linux Fundamentals
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Linux Fundamentals
- Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Linux Fundamentals
- Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Linux Fundamentals
- Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Linux Fundamentals
- Case Study: Disk Full Root Services Down (Case Study, L1) — Linux Fundamentals
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Linux Fundamentals
- Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Linux Fundamentals