- observability
- l1
- topic-pack
- tracing --- Portal | Level: L1: Foundations | Topics: Tracing | Domain: Observability
Distributed Tracing Primer¶
Why This Matters¶
In a monolith, a stack trace tells you what happened. In microservices, a single user request fans out across dozens of services, queues, and databases. Without distributed tracing, debugging latency or failures means grepping logs across 15 services and correlating by timestamp. Tracing gives you a single view of the entire request lifecycle, showing where time is spent and where failures originate.
Fun fact: Google's Dapper paper (2010) pioneered distributed tracing at scale, inspiring open-source implementations like Zipkin (Twitter, 2012), Jaeger (Uber, 2015), and the OpenTelemetry standard. The W3C Trace Context header (
traceparent) was standardized in 2020, making trace propagation interoperable across vendors.Under the hood: Trace context propagation works by injecting a
traceparentheader (e.g.,00-<trace-id>-<span-id>-01) into every outgoing HTTP request or message. Each service extracts the trace/span IDs, creates a child span, and injects the updated header downstream. The overhead is typically less than 1ms per span.
Core Concepts¶
Traces and Spans¶
A trace represents the full journey of a request through the system. It is a directed acyclic graph of spans.
A span represents a single unit of work:
- Has a start time and duration
- Has a name (e.g., HTTP GET /api/users)
- Belongs to a service
- May have a parent span (forming the trace tree)
- Carries key-value attributes (tags)
- Can log events (timestamped annotations)
Trace: user-request-abc123
├── [frontend] HTTP GET /checkout (200ms)
│ ├── [cart-svc] getCart (45ms)
│ ├── [payment-svc] charge (120ms)
│ │ └── [stripe-api] POST (95ms)
│ └── [inventory-svc] reserve (30ms)
Span Attributes¶
Attributes provide context for analysis:
| Attribute | Example |
|---|---|
http.method |
GET |
http.status_code |
500 |
db.system |
postgresql |
db.statement |
SELECT * FROM users |
error |
true |
user.id |
usr_12345 |
Span Status¶
- Unset: default, no explicit status
- OK: operation completed successfully
- Error: operation failed (should include error message)
Remember: The three pillars of observability: Metrics (what is happening), Logs (why it happened), Traces (where it happened in the request path). Traces are the only pillar that shows causality across service boundaries. Mnemonic: "MLT" — Measure, Log, Trace.
Trace Context Propagation¶
For tracing to work across services, context must be propagated in request headers.
W3C Trace Context (Standard)¶
Example:
trace-id: 32-hex-char unique trace identifierspan-id: 16-hex-char current span identifierflags:01= sampled,00= not sampled
Propagation in Practice¶
Every outgoing HTTP call, gRPC call, or message queue publish must forward the trace context. Libraries handle this automatically when instrumented:
# OpenTelemetry auto-instrumentation handles propagation
from opentelemetry.instrumentation.requests import RequestsInstrumentor
RequestsInstrumentor().instrument()
# Now every requests.get() call propagates trace context
Sampling¶
In production, tracing every request creates too much data. Sampling strategies control what gets recorded:
| Strategy | Description |
|---|---|
| Head-based | Decision made at trace start (e.g., sample 10% of requests) |
| Tail-based | Decision made after trace completes (keep errors, slow requests) |
| Rate-limiting | Fixed N traces per second |
| Priority | Always trace certain paths (e.g., payment flows) |
Head-based is simpler but misses interesting traces. Tail-based catches anomalies but requires a collector that buffers complete traces before deciding.
Gotcha: At 1% head-based sampling, you need 100 occurrences of a bug to have a reasonable chance of capturing one trace showing it. Rare errors in high-traffic services may never be sampled. Tail-based sampling solves this by keeping all error traces, but it requires the OTel Collector to buffer complete traces in memory — sizing the Collector's memory limit is critical.
Under the hood: A trace ID is a 128-bit random number (16 bytes, 32 hex chars). The probability of a collision is astronomically low (birthday paradox: you would need ~2^64 traces for a 50% collision chance). Span IDs are 64-bit (8 bytes, 16 hex chars). These IDs are the glue that connects spans across services — if any service drops the header, the trace breaks into disconnected fragments.
Tracing Backends¶
Jaeger¶
Name origin: Jaeger is the German word for "hunter" — fitting for a tool that helps you hunt down latency and errors across distributed services. It was built at Uber in 2015 to handle tracing at massive scale (thousands of microservices). Uber donated it to the CNCF in 2017, and it graduated in 2019.
Open-source, CNCF graduated. Good for Kubernetes-native deployments:
# Jaeger all-in-one for dev
apiVersion: apps/v1
kind: Deployment
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
ports:
- containerPort: 16686 # UI
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
Zipkin¶
Simpler, older. Good for getting started quickly:
Grafana Tempo¶
Designed for massive scale. Object-storage backend (S3, GCS). Integrates with Grafana for visualization. No indexing — uses trace IDs directly or derives from logs/metrics:
OpenTelemetry¶
Timeline: OpenTelemetry (OTel) was formed in 2019 by merging two competing projects: OpenTracing (CNCF, tracing-focused) and OpenCensus (Google, metrics + tracing). The merge ended a confusing "which standard do I use?" era. OTel is now the second-most-active CNCF project after Kubernetes. It is the de facto standard for all new instrumentation — if you are choosing a tracing library in 2024+, OTel is the answer.
OpenTelemetry (OTel) is the vendor-neutral standard for instrumentation. It provides: - APIs: for creating spans and metrics - SDKs: language-specific implementations - Collector: receives, processes, and exports telemetry data
The Collector can fan-out to multiple backends simultaneously.
Practical Patterns¶
Correlating Traces with Logs¶
Inject trace IDs into log lines so you can jump from a log entry to the full trace:
Finding Root Cause¶
- Start from an alert or error log
- Extract the trace ID
- View the full trace in the UI
- Identify the span with the error or highest latency
- Check span attributes and events for details
Service Dependency Maps¶
One-liner: A service dependency map built from traces is worth more than any hand-drawn architecture diagram — it shows what actually calls what in production, not what someone thinks calls what. Review it after every major deployment.
Tracing backends automatically build service dependency graphs from span data, showing which services call which and where bottlenecks exist.
Common Pitfalls¶
- Missing propagation: One uninstrumented service breaks the trace chain
- Over-sampling: Tracing 100% of traffic in production overwhelms storage
- Clock skew: Spans appear out of order when host clocks diverge — use NTP
- Sensitive data in attributes: Never put PII or secrets in span attributes
- Ignoring trace context in async: Message queues and background jobs must carry trace context explicitly
Wiki Navigation¶
Related Content¶
- OpenTelemetry (Topic Pack, L2) — Tracing
- Tracing Flashcards (CLI) (flashcard_deck, L1) — Tracing
- perf Profiling (Topic Pack, L2) — Tracing
- strace (Topic Pack, L1) — Tracing