Skip to content

Interview Gauntlet: eBPF for Observability

Category: Architecture Trade-offs Difficulty: L3 Duration: 15-20 minutes Domains: eBPF, Observability


Round 1: The Opening

Interviewer: "Your team is evaluating eBPF-based observability tools — things like Pixie, Cilium Hubble, or Tetragon. What would eBPF give you that your current Prometheus + OpenTelemetry stack doesn't?"

Strong Answer:

"eBPF-based observability provides three things that traditional instrumentation doesn't. First, zero-instrumentation visibility. eBPF programs attach to kernel functions and syscalls, so you can capture every HTTP request, DNS lookup, TCP connection, and file access without modifying application code or adding SDKs. For a polyglot environment with services in Go, Python, Java, and Node.js, getting consistent observability without instrumenting each language separately is a huge win. Second, kernel-level metrics that are impossible to get from application instrumentation: syscall latency, TCP retransmissions, page faults, scheduler latency, file I/O patterns. These are invisible to Prometheus scrapers but critical for debugging performance issues. Third, security observability. Tools like Tetragon can trace process execution, file access, and network connections at the kernel level and generate security events (like 'this process opened /etc/shadow' or 'this container made an unexpected outbound connection'). Traditional observability tools don't cover the security dimension. That said, eBPF doesn't replace Prometheus and OpenTelemetry — it complements them. Application-level business metrics (orders per second, payment success rate, user signups) still need application instrumentation because eBPF can't understand business logic."

Common Weak Answers:

  • "eBPF is faster than Prometheus." — This conflates different measurement types. Prometheus collects application-level metrics; eBPF collects kernel-level data. They're not competing tools.
  • "eBPF replaces the need for OpenTelemetry." — eBPF can auto-generate request traces without instrumentation, but it can't capture application-specific context like user IDs, feature flags, or business events.
  • "We should adopt it because it's cutting-edge." — Technology adoption driven by novelty, not need.

Round 2: The Probe

Interviewer: "What are the kernel version requirements for eBPF observability tools, and what happens if some of your nodes are running older kernels?"

What the interviewer is testing: Practical knowledge of eBPF deployment constraints. Many candidates know what eBPF does but not what it requires.

Strong Answer:

"eBPF capabilities are kernel-version dependent, and the requirements vary by feature. Basic eBPF tracing (kprobes, uprobes) requires kernel 4.4+, which is widely available. But modern eBPF observability tools use features that need newer kernels: BPF Type Format (BTF), which enables portable eBPF programs (CO-RE — Compile Once, Run Everywhere), requires kernel 5.2+. Ring buffer for efficient event streaming requires 5.8+. Newer map types and helper functions are added in each kernel release. In practice, most eBPF observability tools (Pixie, Cilium Hubble, Tetragon) require kernel 4.14+ at minimum, and recommend 5.4+ or 5.10+ for full feature support. For a Kubernetes cluster: if you're on a recent distribution (Ubuntu 22.04 has 5.15, Amazon Linux 2023 has 6.1, Bottlerocket has 5.10+), you're fine. If you have older nodes (CentOS 7 with kernel 3.10, for example), eBPF tools either won't install or will run in degraded mode. The mitigation for mixed kernel environments is: label nodes with their kernel version, deploy eBPF-based tools only to compatible nodes using a DaemonSet with a nodeSelector, and use traditional instrumentation for the older nodes. Or — the better option — upgrade the kernel, which is usually a node group rotation in a managed Kubernetes cluster."

Trap Alert:

If the candidate bluffs here: The interviewer will ask "What is BTF and why does it matter?" BTF (BPF Type Format) is a metadata format that describes kernel data structures. Without BTF, eBPF programs need to be compiled against the exact kernel headers of the target machine. With BTF, the eBPF program is compiled once and the BPF loader relocates field offsets at load time using BTF information from the running kernel. This is CO-RE and it's what makes eBPF tools portable across kernel versions. It's perfectly fine to say "I know BTF enables portability across kernel versions but I'd need to look up the exact mechanism."


Round 3: The Constraint

Interviewer: "Your team has 10 engineers. None of them have eBPF experience. The learning curve is steep. How do you adopt eBPF-based observability without it becoming a black box that nobody on the team can debug?"

Strong Answer:

"I'd differentiate between using eBPF tools and writing eBPF programs — most teams need the former but not the latter. Using a tool like Cilium Hubble is similar to using any other observability tool: install it, configure it, read the dashboards. The eBPF part is an implementation detail, like how Prometheus internally uses memory-mapped files. The team doesn't need to understand BPF bytecode to use Hubble's network flow visualization. That said, when things go wrong (and they will), someone needs to understand enough to debug issues like: 'Why is Hubble not showing flows for this namespace?' or 'Why did the eBPF program fail to load on this node?' I'd invest in training progressively. Phase one: one engineer takes Cilium's certification course and becomes the eBPF champion. They handle deployment, configuration, and escalations. Phase two: create internal runbooks for common issues (eBPF program load failures, kernel version mismatches, Hubble connectivity issues). These runbooks let on-call engineers resolve common problems without deep eBPF knowledge. Phase three: once the team is comfortable using the tools, optionally invest in deeper knowledge — a workshop on writing custom eBPF programs using bpftrace for ad-hoc kernel tracing during incidents. bpftrace is a good entry point because its syntax is similar to awk and it doesn't require C programming. The risk to avoid: deploying eBPF tools and then being unable to debug them when they fail, which turns the observability layer itself into a blind spot."

The Senior Signal:

What separates a senior answer: Separating the "use" skill from the "build" skill. Most teams don't need to write eBPF programs — they need to use tools built on eBPF. This distinction is like using Prometheus vs writing a Prometheus exporter in Go. Also: the progressive training approach with a champion model is practical and proven for adopting complex infrastructure technology.


Round 4: The Curveball

Interviewer: "A senior engineer argues: 'We should skip eBPF entirely and just improve our OpenTelemetry instrumentation. Auto-instrumentation agents give us traces, metrics, and logs without modifying code, and they work on any kernel.' Is that a valid position?"

Strong Answer:

"It's a strong argument and might be the right call depending on priorities. OpenTelemetry auto-instrumentation (Java agent, Python auto-instrumentation, Node.js auto-instrumentation) provides request traces, HTTP metrics, database query timing, and error tracking without code changes. For most application-level observability needs, OTel auto-instrumentation covers 80-90% of what teams need. The key advantages over eBPF: wider language support with deeper semantic understanding (OTel agents understand framework-level concepts like Spring Boot controllers, Django views, Express routes), richer context propagation (trace IDs, baggage, span attributes with business context), and no kernel version dependency. Where eBPF still wins: infrastructure-level visibility that OTel agents can't provide. TCP connection details, DNS resolution, syscall-level latency, container-to-container network flows, and security events are invisible to application-level instrumentation. My recommendation depends on what the team's observability gaps are. If the biggest gap is 'we can't trace requests across services,' OTel auto-instrumentation is the right investment. If the biggest gap is 'we can't see what's happening at the network and kernel level,' eBPF is the right investment. For most teams, the application-level observability gap is larger and more impactful, which means the senior engineer's position is correct for their situation. The hybrid approach — OTel for application-level, eBPF for infrastructure-level — is the long-term answer, but the sequencing matters."

Trap Question Variant:

The right answer is "Both have merit; it depends on the gap." Candidates who dismiss the senior engineer's position are showing eBPF bias. Candidates who completely agree without acknowledging eBPF's unique capabilities are missing the infrastructure observability dimension. The senior answer evaluates both positions fairly, identifies the deciding factor (what's the biggest gap?), and suggests sequencing.


Round 5: The Synthesis

Interviewer: "Where do you see the observability landscape heading in the next 2-3 years? What should teams invest in now to be well-positioned?"

Strong Answer:

"Three trends are converging. First, auto-instrumentation is becoming the default. Both OpenTelemetry auto-instrumentation and eBPF-based auto-discovery are reducing the need for manual instrumentation. In 2-3 years, getting basic request traces, network flows, and performance metrics will require zero code changes for most workloads. The investment now is adopting OpenTelemetry as the standard — it's vendor-neutral, and even if you switch backends (Datadog to Grafana Cloud, or vice versa), the instrumentation stays. Second, observability is merging with security. Tools like Tetragon, Falco, and cloud-native runtime security are using the same data sources (kernel events, network flows, process execution) as observability tools. The 'SIEM' and the 'monitoring dashboard' are converging into unified platforms. Teams should invest in a single telemetry pipeline (OpenTelemetry Collector) that feeds both observability and security backends. Third, AI-assisted analysis. The volume of observability data is growing faster than teams can review it. ML-based anomaly detection, automated root cause analysis, and intelligent alert correlation will move from 'nice to have' to 'necessary.' The investment now is clean, structured, correlated data — consistent labels, trace IDs propagated everywhere, events linked to metrics — because AI tools need good data to produce good results. The practical advice: standardize on OpenTelemetry for application telemetry, evaluate eBPF tools for infrastructure telemetry, and ensure your data pipeline can feed multiple backends without lock-in."

What This Sequence Tested:

Round Skill Tested
1 Understanding of eBPF observability capabilities and limitations
2 Practical deployment constraints (kernel versions, BTF)
3 Team adoption strategy for complex infrastructure technology
4 Fair comparison of competing approaches (eBPF vs OTel)
5 Strategic thinking about observability evolution

Prerequisite Topic Packs