observability
l1
topic-pack
tracing --- Portal | Level: L1: Foundations | Topics: Tracing | Domain: Observability

Distributed Tracing Primer¶

Why This Matters¶

In a monolith, a stack trace tells you what happened. In microservices, a single user request fans out across dozens of services, queues, and databases. Without distributed tracing, debugging latency or failures means grepping logs across 15 services and correlating by timestamp. Tracing gives you a single view of the entire request lifecycle, showing where time is spent and where failures originate.

Fun fact: Google's Dapper paper (2010) pioneered distributed tracing at scale, inspiring open-source implementations like Zipkin (Twitter, 2012), Jaeger (Uber, 2015), and the OpenTelemetry standard. The W3C Trace Context header (traceparent) was standardized in 2020, making trace propagation interoperable across vendors.

Under the hood: Trace context propagation works by injecting a traceparent header (e.g., 00-<trace-id>-<span-id>-01) into every outgoing HTTP request or message. Each service extracts the trace/span IDs, creates a child span, and injects the updated header downstream. The overhead is typically less than 1ms per span.

Core Concepts¶

Traces and Spans¶

A trace represents the full journey of a request through the system. It is a directed acyclic graph of spans.

A span represents a single unit of work: - Has a start time and duration - Has a name (e.g., HTTP GET /api/users) - Belongs to a service - May have a parent span (forming the trace tree) - Carries key-value attributes (tags) - Can log events (timestamped annotations)

Trace: user-request-abc123
├── [frontend] HTTP GET /checkout  (200ms)
│   ├── [cart-svc] getCart          (45ms)
│   ├── [payment-svc] charge       (120ms)
│   │   └── [stripe-api] POST      (95ms)
│   └── [inventory-svc] reserve    (30ms)

Span Attributes¶

Attributes provide context for analysis:

Attribute	Example
`http.method`	GET
`http.status_code`	500
`db.system`	postgresql
`db.statement`	SELECT * FROM users
`error`	true
`user.id`	usr_12345

Span Status¶

Unset: default, no explicit status
OK: operation completed successfully
Error: operation failed (should include error message)

Remember: The three pillars of observability: Metrics (what is happening), Logs (why it happened), Traces (where it happened in the request path). Traces are the only pillar that shows causality across service boundaries. Mnemonic: "MLT" — Measure, Log, Trace.

Trace Context Propagation¶

For tracing to work across services, context must be propagated in request headers.

W3C Trace Context (Standard)¶

traceparent: 00-<trace-id>-<span-id>-<flags>
tracestate: vendor1=value1,vendor2=value2

Example:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

trace-id: 32-hex-char unique trace identifier
span-id: 16-hex-char current span identifier
flags: 01 = sampled, 00 = not sampled

Propagation in Practice¶

Every outgoing HTTP call, gRPC call, or message queue publish must forward the trace context. Libraries handle this automatically when instrumented:

# OpenTelemetry auto-instrumentation handles propagation
from opentelemetry.instrumentation.requests import RequestsInstrumentor
RequestsInstrumentor().instrument()
# Now every requests.get() call propagates trace context

Sampling¶

In production, tracing every request creates too much data. Sampling strategies control what gets recorded:

Strategy	Description
Head-based	Decision made at trace start (e.g., sample 10% of requests)
Tail-based	Decision made after trace completes (keep errors, slow requests)
Rate-limiting	Fixed N traces per second
Priority	Always trace certain paths (e.g., payment flows)

Head-based is simpler but misses interesting traces. Tail-based catches anomalies but requires a collector that buffers complete traces before deciding.

Gotcha: At 1% head-based sampling, you need 100 occurrences of a bug to have a reasonable chance of capturing one trace showing it. Rare errors in high-traffic services may never be sampled. Tail-based sampling solves this by keeping all error traces, but it requires the OTel Collector to buffer complete traces in memory — sizing the Collector's memory limit is critical.

Under the hood: A trace ID is a 128-bit random number (16 bytes, 32 hex chars). The probability of a collision is astronomically low (birthday paradox: you would need ~2^64 traces for a 50% collision chance). Span IDs are 64-bit (8 bytes, 16 hex chars). These IDs are the glue that connects spans across services — if any service drops the header, the trace breaks into disconnected fragments.

Tracing Backends¶

Jaeger¶

Name origin: Jaeger is the German word for "hunter" — fitting for a tool that helps you hunt down latency and errors across distributed services. It was built at Uber in 2015 to handle tracing at massive scale (thousands of microservices). Uber donated it to the CNCF in 2017, and it graduated in 2019.

Open-source, CNCF graduated. Good for Kubernetes-native deployments:

# Jaeger all-in-one for dev
apiVersion: apps/v1
kind: Deployment
spec:
  containers:
    - name: jaeger
      image: jaegertracing/all-in-one:latest
      ports:
        - containerPort: 16686  # UI
        - containerPort: 4317   # OTLP gRPC
        - containerPort: 4318   # OTLP HTTP

Zipkin¶

Simpler, older. Good for getting started quickly:

docker run -d -p 9411:9411 openzipkin/zipkin

Grafana Tempo¶

Designed for massive scale. Object-storage backend (S3, GCS). Integrates with Grafana for visualization. No indexing — uses trace IDs directly or derives from logs/metrics:

# Tempo config
storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com

OpenTelemetry¶

Timeline: OpenTelemetry (OTel) was formed in 2019 by merging two competing projects: OpenTracing (CNCF, tracing-focused) and OpenCensus (Google, metrics + tracing). The merge ended a confusing "which standard do I use?" era. OTel is now the second-most-active CNCF project after Kubernetes. It is the de facto standard for all new instrumentation — if you are choosing a tracing library in 2024+, OTel is the answer.

OpenTelemetry (OTel) is the vendor-neutral standard for instrumentation. It provides: - APIs: for creating spans and metrics - SDKs: language-specific implementations - Collector: receives, processes, and exports telemetry data

App (OTel SDK) → OTel Collector → Backend (Jaeger/Tempo/etc.)

The Collector can fan-out to multiple backends simultaneously.

Practical Patterns¶

Correlating Traces with Logs¶

Inject trace IDs into log lines so you can jump from a log entry to the full trace:

{"level":"error","msg":"payment failed","trace_id":"abc123","span_id":"def456"}

Finding Root Cause¶

Start from an alert or error log
Extract the trace ID
View the full trace in the UI
Identify the span with the error or highest latency
Check span attributes and events for details

Service Dependency Maps¶

One-liner: A service dependency map built from traces is worth more than any hand-drawn architecture diagram — it shows what actually calls what in production, not what someone thinks calls what. Review it after every major deployment.

Tracing backends automatically build service dependency graphs from span data, showing which services call which and where bottlenecks exist.

Common Pitfalls¶

Missing propagation: One uninstrumented service breaks the trace chain
Over-sampling: Tracing 100% of traffic in production overwhelms storage
Clock skew: Spans appear out of order when host clocks diverge — use NTP
Sensitive data in attributes: Never put PII or secrets in span attributes
Ignoring trace context in async: Message queues and background jobs must carry trace context explicitly

OpenTelemetry (Topic Pack, L2) — Tracing
Tracing Flashcards (CLI) (flashcard_deck, L1) — Tracing
perf Profiling (Topic Pack, L2) — Tracing
strace (Topic Pack, L1) — Tracing

Distributed Tracing Primer¶

Why This Matters¶

Core Concepts¶

Traces and Spans¶

Span Attributes¶

Span Status¶

Trace Context Propagation¶

W3C Trace Context (Standard)¶

Propagation in Practice¶

Sampling¶

Tracing Backends¶

Jaeger¶

Zipkin¶

Grafana Tempo¶

OpenTelemetry¶

Practical Patterns¶

Correlating Traces with Logs¶

Finding Root Cause¶

Service Dependency Maps¶

Common Pitfalls¶

Wiki Navigation¶

Pages that link here¶

Distributed Tracing Primer¶

Why This Matters¶

Core Concepts¶

Traces and Spans¶

Span Attributes¶

Span Status¶

Trace Context Propagation¶

W3C Trace Context (Standard)¶

Propagation in Practice¶

Sampling¶

Tracing Backends¶

Jaeger¶

Zipkin¶

Grafana Tempo¶

OpenTelemetry¶

Practical Patterns¶

Correlating Traces with Logs¶

Finding Root Cause¶

Service Dependency Maps¶

Common Pitfalls¶

Wiki Navigation¶

Related Content¶

Pages that link here¶