Skip to content

Portal | Level: L3: Advanced | Topics: eBPF, Linux Fundamentals | Domain: Linux

eBPF & Modern Linux Observability - Primer

Why This Matters

For decades, deep Linux observability meant either reading /proc counters (limited) or loading kernel modules (dangerous). eBPF changes the game entirely. It lets you run sandboxed programs inside the kernel without writing kernel modules, without rebooting, and without risking a panic. You get function-level tracing, network packet inspection, and performance profiling — all in production, with negligible overhead.

If you've ever stared at top during a performance issue and wished you could see which exact system calls were slow, which TCP connections were retransmitting, or which files were being opened — eBPF tools give you that visibility. This is L3 advanced territory, but every senior ops person needs to know what's available.

Core Concepts

Name origin: eBPF stands for "extended Berkeley Packet Filter." The original BPF was created by Steven McCanne and Van Jacobson at Lawrence Berkeley National Laboratory in 1992 for efficient packet filtering (used by tcpdump). In 2014, Alexei Starovoitov and Daniel Borkmann reimplemented BPF as a general-purpose in-kernel virtual machine, adding maps, helper functions, and a verifier. The "e" (extended) stuck even though modern eBPF has virtually nothing in common with the original BPF packet filter. The community now considers "eBPF" a standalone term, not an acronym.

1. What eBPF Actually Is

Traditional Kernel Instrumentation:

  User Space          Kernel Space
  ┌──────────┐       ┌──────────────────┐
  │ Your App  │──────▶│ System Calls      │
  └──────────┘       │                    │
                      │ ??? (black box)    │
                      │                    │
                      │ Hardware           │
                      └──────────────────┘

  To see inside: load a kernel module (risky)
                 or recompile the kernel (impractical)

With eBPF:

  User Space          Kernel Space
  ┌──────────┐       ┌──────────────────┐
  │ Your App  │──────▶│ System Calls      │
  └──────────┘       │   ↕ eBPF probe    │
  ┌──────────┐       │ Scheduler          │
  │ BCC/bpf- │◀──────│   ↕ eBPF probe    │
  │ trace     │       │ Network stack      │
  └──────────┘       │   ↕ eBPF probe    │
                      │ Block I/O          │
                      │   ↕ eBPF probe    │
                      └──────────────────┘

  eBPF programs:
  - Run in a kernel sandbox (verified before loading)
  - Cannot crash the kernel (verifier guarantees safety)
  - Can attach to thousands of kernel and user-space functions
  - Negligible performance overhead (< 1% typical)

2. The eBPF Safety Model

Property Guarantee
Memory safety Verifier checks all memory accesses before loading
Termination Programs must provably terminate (bounded loops only)
Privilege Requires CAP_BPF or root (not arbitrary user access)
Stack size Limited to 512 bytes (forces efficient code)

Analogy: eBPF is to the kernel what JavaScript is to the browser. Before JavaScript, you had to ask the browser vendor to add new features. With JavaScript, you run custom code inside the browser's sandbox. Similarly, before eBPF, you had to wait for kernel developers to add new tracing features (or write a dangerous kernel module). With eBPF, you inject sandboxed programs into the running kernel on demand. The verifier is the equivalent of the browser's sandbox — it guarantees your code cannot crash the host. | Helper functions | Can only call approved kernel helpers |

eBPF Program Lifecycle:

  1. Write program (C, or via BCC/bpftrace)
  2. Compile to eBPF bytecode
  3. Verifier checks safety (rejects unsafe programs)
  4. JIT compiles to native machine code
  5. Attaches to kernel hook point (kprobe, tracepoint, etc.)
  6. Runs on every event at the hook point
  7. Sends data to user space via maps or ring buffers
  8. Detach when done (clean removal, no reboot)

3. BCC Tools — The Essential Toolkit

BCC (BPF Compiler Collection) provides ready-to-use tools for common observability tasks. These are production-safe and well-tested.

# Install BCC tools
# RHEL/CentOS
dnf install -y bcc-tools

# Ubuntu/Debian
apt install -y bpfcc-tools linux-headers-$(uname -r)

# Tools are typically in /usr/share/bcc/tools/ or available as commands

Process and Execution Tracing

Tool What It Shows Use Case
execsnoop Every new process execution "What's spawning all these processes?"
opensnoop Every file open() call "Who's reading my config file?"
exitsnoop Process exits with return code "Why do processes keep dying?"
runqlat Scheduler run queue latency "Are processes waiting too long to run?"
cpudist On-CPU time distribution "How long are tasks actually running?"
# Watch every process that starts (like a system-wide strace for exec)
execsnoop-bpfcc
# Output:
# PCOMM  PID    PPID   RET ARGS
# bash   14501  14500    0 /bin/bash
# curl   14502  14501    0 /usr/bin/curl -s https://api.example.com

# Watch every file being opened
opensnoop-bpfcc
# Output:
# PID    COMM     FD ERR PATH
# 14500  nginx     7   0 /etc/nginx/nginx.conf
# 14500  nginx     8   0 /var/log/nginx/access.log

# Show scheduler queue latency (are CPUs overloaded?)
runqlat-bpfcc
# Output: histogram of time (usec) processes waited in the run queue
# If you see significant time > 1ms, your CPUs are saturated

Network Tracing

Tool What It Shows Use Case
tcpconnect Every outbound TCP connection "What's this process connecting to?"
tcpaccept Every inbound TCP connection "Who's connecting to this server?"
tcpretrans TCP retransmissions "Why is my network slow?"
tcplife TCP session lifecycle (connect → close) "How long do connections live?"
tcptracer TCP connect/accept/close events Combined network event stream
# Show all outbound TCP connections as they happen
tcpconnect-bpfcc
# Output:
# PID    COMM     IP SADDR        DADDR        DPORT
# 14501  curl     4  10.0.1.50    93.184.216.34 443
# 14502  python   4  10.0.1.50    10.0.1.100    5432

# Show TCP retransmissions (network quality indicator)
tcpretrans-bpfcc
# Output:
# TIME     PID  IP LADDR:LPORT   RADDR:RPORT   STATE
# 14:21:53 0    4  10.0.1.50:443 10.0.2.30:52134 ESTABLISHED
# High retransmit rate = network congestion or packet loss

# Show TCP session lifetimes
tcplife-bpfcc
# Output:
# PID   COMM     IP LADDR     LPORT RADDR     RPORT TX_KB RX_KB MS
# 14500 nginx    4  10.0.1.50 443   10.0.2.1  52134 50    2     1523
# Long-lived connections with low throughput? Possible connection leak.

Disk and I/O Tracing

Tool What It Shows Use Case
biolatency Block I/O latency histogram "Is my disk slow?"
biosnoop Every block I/O operation "What's reading/writing to disk?"
ext4slower Slow ext4 filesystem operations "Which file ops are taking too long?"
filetop Top files by I/O "What's generating all this disk activity?"
cachestat Page cache hit/miss ratio "Is my application hitting disk or cache?"
# Show block I/O latency distribution
biolatency-bpfcc
# Output: histogram of disk operation latency
# If you see operations > 10ms frequently, your storage is struggling

# Show every disk I/O operation with latency
biosnoop-bpfcc
# Output:
# TIME(s)  COMM      PID  DISK  T SECTOR   BYTES  LAT(ms)
# 0.000    postgres  1234 sda   R 12345678 4096   0.45
# 0.001    postgres  1234 sda   W 12345680 8192   1.23

# Show which files are generating the most I/O
filetop-bpfcc
# Refreshes like top, showing files ranked by read/write bytes

# Page cache hit ratio (are reads hitting RAM or disk?)
cachestat-bpfcc
# Output:
# HITS   MISSES  DIRTIES  HITRATIO  BUFFERS_MB
# 15843  234     45       98.54%    1024
# Hit ratio < 90%? You may need more RAM or your working set is too large

Fun fact: BCC was created by Brenden Blanco at PLUMgrid (later IOVisor). The tools were popularized by Brendan Gregg, who wrote most of the 100+ BCC tools in the repository while at Netflix. Gregg's 2019 book "BPF Performance Tools" is the definitive reference. The BCC tools are deliberately named after the questions they answer: execsnoop snoops exec calls, biolatency shows block I/O latency, tcpretrans traces TCP retransmissions.

4. bpftrace — One-Liners for Deep Tracing

bpftrace is a high-level tracing language for eBPF. Think of it as awk for kernel tracing.

# Install bpftrace
dnf install -y bpftrace        # RHEL
apt install -y bpftrace          # Ubuntu

# Count system calls by process name
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# Show which files a specific process is reading
bpftrace -e 'tracepoint:syscalls:sys_enter_openat /comm == "nginx"/ {
  printf("%s opened %s\n", comm, str(args->filename));
}'

# Histogram of read() sizes by process
bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
  @bytes[comm] = hist(args->ret);
}'

# Trace DNS lookups (getaddrinfo)
bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:getaddrinfo {
  printf("%s is resolving DNS: %s\n", comm, str(arg0));
}'

# Show latency of disk I/O by disk device
bpftrace -e 'tracepoint:block:block_rq_issue { @start[args->dev, args->sector] = nsecs; }
  tracepoint:block:block_rq_complete /@start[args->dev, args->sector]/ {
  @usecs = hist((nsecs - @start[args->dev, args->sector]) / 1000);
  delete(@start[args->dev, args->sector]);
}'

# Trace why a process is waking up (useful for unexpected CPU usage)
bpftrace -e 'tracepoint:sched:sched_wakeup /args->comm == "myapp"/ {
  printf("woke by: %s (PID %d)\n", comm, pid);
}'

5. Real Production Debugging Scenarios

Scenario: Mystery Latency Spikes

Symptom: API latency spikes to 5s every few minutes. CPU, memory, disk look fine.

Investigation with eBPF:

  Step 1: Check if it's network
    tcpretrans-bpfcc
    → Result: No retransmissions. Network is clean.

  Step 2: Check if it's disk I/O
    biolatency-bpfcc
    → Result: All I/O under 1ms. Disk is fast.

  Step 3: Check if it's scheduling (CPU contention)
    runqlat-bpfcc
    → Result: Spikes of 50-200ms run queue latency every 2 minutes.
    Something is hogging the CPU and displacing our app.

  Step 4: Find what's consuming CPU at spike time
    bpftrace -e 'profile:hz:99 { @[comm] = count(); }'
    → Result: "logrotate" consuming massive CPU every 2 minutes.

  Root cause: logrotate is compressing a 10GB log file in-place.
  The compression consumes 100% of one CPU core for several seconds,
  causing scheduler delays for other processes.

  Fix: Move log compression to off-peak hours or use zstd (faster).

Scenario: Connection Leak

Symptom: Server gradually runs out of file descriptors. Connections build up.

Investigation with eBPF:

  Step 1: See what's connecting
    tcplife-bpfcc
    → Result: Thousands of connections from "myapp" to Redis,
    all lasting > 300 seconds with zero bytes transferred.

  Step 2: Confirm connections are idle
    tcptracer-bpfcc
    → Result: Connections are being opened but never closed.

  Root cause: Connection pool not returning connections after timeout.
  Fix: Configure connection pool idle timeout and max lifetime.

6. When to Use What

Decision Tree: Which eBPF Tool?

  "Something is slow"
    → Is it CPU? ──────────▶ runqlat, cpudist, profile
    → Is it disk? ─────────▶ biolatency, biosnoop, ext4slower
    → Is it network? ──────▶ tcpretrans, tcplife, tcpconnect
    → Not sure? ───────────▶ runqlat first (most common hidden cause)

  "Something is happening that shouldn't be"
    → Unexpected processes? ▶ execsnoop
    → Unexpected file access? ▶ opensnoop
    → Unexpected connections? ▶ tcpconnect, tcpaccept

  "I need to understand a specific process"
    → What syscalls? ───────▶ bpftrace tracepoint:syscalls:*
    → What files? ──────────▶ opensnoop -p PID
    → What network? ────────▶ tcpconnect -p PID
    → Custom tracing? ─────▶ bpftrace (write your own)

Common Pitfalls

Under the hood: BPF CO-RE (Compile Once, Run Everywhere), introduced in kernel 5.8, solved eBPF's biggest deployment pain point. Before CO-RE, eBPF programs had to be compiled on the target machine because they contained kernel struct offsets that varied between kernel versions. CO-RE uses BTF (BPF Type Format) metadata embedded in the kernel to relocate struct accesses at load time. This means you can compile an eBPF program once and run it on any kernel version that has BTF enabled — no headers required.

  1. Forgetting to install kernel headers — eBPF tools need headers matching your running kernel. apt install linux-headers-$(uname -r) or dnf install kernel-devel-$(uname -r).
  2. Running eBPF tools as non-root without CAP_BPF — Most tools need root or specific capabilities. Don't chmod them to SUID — use capabilities properly.
  3. Leaving tracing running in production — BCC tools have low overhead but not zero. Run them during investigation, not permanently. For continuous monitoring, use purpose-built eBPF-based exporters.
  4. Interpreting histograms without understanding the baseline — A biolatency histogram means nothing if you don't know what normal looks like. Capture baselines during healthy operation.
  5. Expecting eBPF on old kernels — Full eBPF support requires kernel 4.9+. Many advanced features need 5.x+. BPF CO-RE (compile once, run everywhere) needs 5.8+. Check your kernel version first.
  6. bpftrace one-liners that trace too broadly — Tracing all syscalls on a busy server generates massive output. Always filter by process name, PID, or specific syscall.

Wiki Navigation

Prerequisites

Next Steps