Tracing — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts about distributed tracing.

Google's Dapper paper (2010) created the modern distributed tracing discipline¶

Google published "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" in 2010, describing their internal tracing system that had been running since 2005. The paper introduced concepts that became industry standard: trace context propagation, span hierarchies, and sampling strategies. Virtually every distributed tracing system built since — Zipkin, Jaeger, Datadog APM — traces its conceptual lineage to Dapper.

Twitter's Zipkin was the first open-source implementation of Dapper-style tracing¶

Twitter released Zipkin in June 2012 as an open-source distributed tracing system directly inspired by Google's Dapper paper. Written in Scala, Zipkin established the reference architecture: instrumented services send spans to a collector, which stores them in a backend (originally Cassandra), and a web UI displays trace timelines. Zipkin moved to the OpenZipkin community project and is still actively maintained.

A single trace can contain thousands of spans in a microservices architecture¶

In a complex microservices system, a single user request might traverse 20-50 services, each generating multiple spans (HTTP calls, database queries, cache lookups). Netflix reported that some of their traces contain over 1,000 spans. Storing and querying these traces at scale is a significant engineering challenge — Uber's Jaeger processes billions of spans per day across their microservices fleet.

OpenTelemetry merged two competing standards to end "the tracing wars"¶

OpenCensus (Google) and OpenTracing (CNCF) were competing standards for instrumentation APIs, creating confusion and fragmentation. In 2019, they merged into OpenTelemetry (OTel), which became a CNCF incubating project. OTel provides a single, vendor-neutral API for traces, metrics, and logs. By 2024, it was the second most active CNCF project after Kubernetes, with contributions from every major observability vendor.

W3C Trace Context became a web standard in 2020¶

The W3C Trace Context specification, published as a Recommendation in February 2020, standardized how trace context is propagated across services via HTTP headers (traceparent and tracestate). Before this standard, every tracing system used its own propagation format (Zipkin's B3, Jaeger's uber-trace-id, AWS X-Ray's X-Amzn-Trace-Id), making cross-system tracing nearly impossible. W3C Trace Context unified propagation across the industry.

Sampling is necessary because tracing everything is prohibitively expensive¶

At high-traffic scales, capturing every trace is economically impossible — storing billions of spans per day requires petabytes of storage. Most production tracing systems sample 1-10% of traces. Head-based sampling (decide at trace start) is simple but can miss rare errors. Tail-based sampling (decide after the trace completes) captures all errors but requires buffering complete traces before the sampling decision, which is architecturally complex.

The "context propagation" problem is the hardest part of distributed tracing¶

Propagating trace context across service boundaries sounds simple (pass a header) but becomes complex with message queues (the consumer might run hours after the producer), batch jobs (one trace spawning thousands of child operations), and serverless functions (new execution contexts per invocation). Context propagation across asynchronous boundaries remains the most common cause of "broken traces" in production.

eBPF-based tracing can instrument applications without any code changes¶

Traditional tracing requires adding instrumentation libraries to application code. eBPF-based approaches (like Pixie, acquired by New Relic in 2023, and Grafana Beyla) can capture trace data from the kernel level, observing network calls and system calls without modifying the application. This "zero-instrumentation" approach is revolutionary for legacy applications and polyglot environments where adding libraries to every service is impractical.

Jaeger was named after a German word meaning "hunter" (of problems in distributed systems)¶

Uber created Jaeger (German for "hunter" or "huntsman") in 2015 to trace requests across their thousands of microservices. It was open-sourced in 2017 and graduated as a CNCF project in 2019. Jaeger's adaptive sampling feature — automatically adjusting sampling rates per service based on traffic volume — solved a practical problem that Zipkin's fixed-rate sampling couldn't address at Uber's scale.

Trace-based testing uses real traces to generate integration tests automatically¶

An emerging practice uses production traces to automatically generate integration tests. By replaying recorded traces against new service versions, teams can detect behavioral changes (different response codes, missing spans, latency regressions) without writing tests manually. Tracetest and Malabi are tools in this space. The concept inverts the traditional relationship: instead of tests validating traces, traces generate tests.

The "three pillars of observability" framing is increasingly seen as incomplete¶

The popular framing of observability as "three pillars" (logs, metrics, traces) has been criticized by observability leaders including Charity Majors (Honeycomb). The argument is that the three pillars are not equal or independent — traces provide the richest context, metrics are the cheapest to store, and logs are the most familiar but least structured. Modern observability increasingly focuses on correlating all three rather than treating them as separate concerns.