Observability Deep Dive — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts about observability.

The term "observability" was borrowed from control theory, not invented by tech companies¶

Observability was defined by Rudolf Kalman in 1960 as a property of dynamic systems: a system is observable if its internal state can be determined from its external outputs. Charity Majors and the Honeycomb team popularized the term in the tech industry around 2017, arguing that traditional monitoring (known unknowns) was insufficient for complex distributed systems (unknown unknowns).

The "three pillars" model is widely taught but its creator considers it misleading¶

The three pillars of observability — metrics, logs, and traces — became the standard framework around 2018. However, practitioners like Charity Majors and Ben Sigelman (co-creator of OpenTelemetry) have argued that the pillars model is misleading because it suggests three separate tools rather than a unified approach. The real goal is understanding system behavior, not collecting three types of telemetry.

Distributed tracing was invented at Google in 2004 but not published until 2010¶

Google's Dapper distributed tracing system was built in 2004 by Benjamin Sigelman and used internally for years before the 2010 paper was published. Dapper inspired virtually every distributed tracing system that followed: Zipkin (Twitter, 2012), Jaeger (Uber, 2015), and eventually OpenTelemetry. Sigelman later co-founded LightStep and helped create OpenTelemetry.

Honeycomb's core insight was that debugging requires high-cardinality data¶

Honeycomb, founded by Charity Majors and Christine Yen in 2016, built their product around the principle that effective debugging requires querying on high-cardinality fields (user IDs, request IDs, build numbers) — exactly the kind of data that traditional metrics systems cannot handle. This insight, drawn from Majors' experience managing infrastructure at Facebook, helped define modern observability as distinct from traditional monitoring.

OpenTelemetry is the second most active CNCF project after Kubernetes¶

OpenTelemetry, formed in 2019 from the merger of OpenTracing and OpenCensus, rapidly became the second most active project in the CNCF (after Kubernetes) measured by number of contributors and commits. By 2024, it had contributions from engineers at over 100 companies and provided instrumentation libraries for 11 programming languages.

Tail-based sampling can reduce tracing costs by 90% while keeping interesting traces¶

Most tracing systems use head-based sampling (decide at the start whether to trace a request) which randomly discards interesting traces. Tail-based sampling waits until a trace is complete, then decides whether to keep it based on whether it contains errors, high latency, or other interesting properties. This approach keeps 100% of interesting traces while sampling out 90%+ of boring ones.

Context propagation across service boundaries was the hardest problem in distributed tracing¶

Getting trace context (trace ID, span ID, sampling decision) to propagate correctly across every service boundary — HTTP headers, message queues, database calls, gRPC — took years of standardization. The W3C Trace Context specification (finalized 2020) defined standard HTTP headers (traceparent, tracestate), finally enabling cross-vendor trace correlation after years of incompatible proprietary formats.

Exemplars bridge the gap between metrics and traces¶

Exemplars, standardized in OpenMetrics and supported by Prometheus since 2021, attach a trace ID to individual metric observations. When you see a latency spike in a histogram, you can click through to the exact trace that caused it. This simple concept — linking a metric data point to a specific trace — took the observability industry years to standardize, but it dramatically accelerates root cause analysis.

The observability market exceeded $20 billion in 2023¶

The combined observability market — including monitoring, APM, log management, and tracing — exceeded $20 billion in annual revenue by 2023. Datadog, Splunk (Cisco), Dynatrace, New Relic, and Elastic are the major commercial players, while open-source alternatives (Prometheus, Grafana, Jaeger, OpenTelemetry) provide viable self-hosted options for cost-conscious organizations.