Skip to content

OpenTelemetry: Following a Request Across Services

  • lesson
  • distributed-tracing
  • opentelemetry
  • context-propagation
  • collector-pipelines
  • sampling
  • kubernetes-networking
  • service-mesh
  • application-instrumentation
  • l2 ---# OpenTelemetry: Following a Request Across Services

Topics: distributed tracing, OpenTelemetry, context propagation, collector pipelines, sampling, Kubernetes networking, service mesh, application instrumentation Level: L2 (Operations) Time: 70–90 minutes Strategy: End-to-end trace (meta: tracing the trace)


The Mission

It's 3:15 PM. A Slack message from the payments team: "Checkout is slow. Users are abandoning carts. We don't know which service is the bottleneck."

Your checkout flow involves five microservices:

User Browser
  └── api-gateway (Go)
        ├── cart-service (Python)
        ├── inventory-service (Java)
        ├── payment-service (Go)
        │     └── stripe-api (external)
        └── notification-service (Node.js)
              └── Kafka → email-worker (Python)

Five services. Four languages. One message queue. An external API call. And somewhere in that chain, something is eating 4 seconds that should take 400 milliseconds.

You have OpenTelemetry instrumented across these services. Your job: follow a single request from the browser click to the confirmation email, find the bottleneck, and fix it.

But first, you need to understand how tracing works — because if you don't understand the plumbing, you can't trust the data.


The Three Pillars: What Each One Can (and Can't) Tell You

Before diving into the trace, a 60-second orientation.

Pillar What it answers What it can't tell you
Metrics "Error rate is 2%. P99 latency is 4.2s." Which requests are slow, or why
Logs "payment-service: timeout connecting to Stripe" Was this the cause or a symptom? What called it?
Traces "This request spent 3.8s in inventory-service's DB query" Long-term trends, aggregate patterns

Metrics detect. Logs explain. Traces locate. You need all three, but traces are the only pillar that shows causality across service boundaries.

Mental Model: Think of metrics as a thermometer (tells you there's a fever), logs as a blood test (tells you what's wrong), and traces as an X-ray (shows you exactly where the problem is). You wouldn't diagnose a broken bone with a blood test.

Trivia: The "three pillars" framing was popularized by Peter Bourgon in his 2017 blog post "Metrics, Tracing, and Logging." Charity Majors (Honeycomb co-founder) has argued the framing is misleading because it implies the three are equal and independent — in practice, traces provide the richest debugging context, and the real power comes from correlating all three via shared trace IDs.


Traces and Spans: The Data Model

A trace is the full journey of one request through your system. A span is one unit of work within that trace — an HTTP handler, a database query, a cache lookup.

Here is a real trace from your slow checkout:

Trace ID: 4bf92f3577b34da6a3ce929d0e0e4736

├── [api-gateway]       POST /api/checkout          4210ms
│   ├── [cart-service]  getCart                        85ms
│   ├── [inventory-svc] checkAvailability            3802ms  ← HERE
│   │   ├──             cache.lookup                    4ms
│   │   └──             db.query                     3791ms  ← AND HERE
│   ├── [payment-svc]   chargeCard                    180ms
│   │   └──             stripe.api.call               142ms
│   └── [notif-svc]     sendConfirmation               38ms
│       └──             kafka.produce                   12ms

Every span carries structured data:

Field Example Purpose
Trace ID 4bf92f3577b34da6a3ce929d0e0e4736 Links all spans in one request (32 hex chars = 128 bits)
Span ID 00f067aa0ba902b7 Unique ID for this span (16 hex chars = 64 bits)
Parent Span ID a1b2c3d4e5f60718 Which span created this one
Operation POST /api/checkout What this span represents
Duration 4210ms Wall-clock time
Attributes http.status_code=200 Key-value metadata
Status OK, ERROR, or Unset Did it succeed?

Under the Hood: A trace ID is a 128-bit random number. The birthday paradox says you'd need roughly 2^64 (18 quintillion) traces before a 50% chance of collision. At 1 million traces per second, that's 584,942 years. You will not run out of trace IDs.

Found it already. The inventory-svc span took 3,802ms, and within it, db.query took 3,791ms. That is a database problem masquerading as a checkout problem. But how did the trace data get here? That is the interesting part.


Context Propagation: How Trace IDs Cross Service Boundaries

Here is the fundamental problem of distributed tracing: Service A starts a trace. Service A calls Service B. How does Service B know it is part of the same trace?

The answer: HTTP headers.

The W3C traceparent Header

When api-gateway calls cart-service, it includes this header:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

Let's break that apart:

00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
│   │                                 │                 │
│   │                                 │                 └── trace-flags (01 = sampled)
│   │                                 │
│   │                                 └── parent-id (16 hex = this span's ID)
│   │
│   └── trace-id (32 hex = shared across all spans in this trace)
└── version (always 00 for now)

When cart-service receives this header, it: 1. Extracts the trace ID (4bf92f...) 2. Extracts the parent span ID (00f067...) 3. Creates a new span with a new span ID, setting parent to 00f067... 4. Attaches the same trace ID 5. When calling downstream services, injects an updated traceparent with its own span ID as the new parent

This is the chain that connects every span in the trace. Break it at any point and the trace splits into disconnected fragments.

Name Origin: The W3C Trace Context specification was ratified as a W3C Recommendation in February 2020. Before it existed, every tracing system had its own propagation format: Zipkin used X-B3-TraceId, Jaeger used uber-trace-id, AWS used X-Amzn-Trace-Id. Cross-vendor tracing was impossible. W3C Trace Context ended the format wars — today, traceparent is the lingua franca of distributed tracing.

There is also a tracestate header for vendor-specific data:

tracestate: rojo=00f067aa0ba902b7,congo=t61rcWkgMzE

Most of the time you can ignore tracestate. The traceparent header is what matters.


Propagation Across Protocols: HTTP, gRPC, and Message Queues

HTTP headers are the easy case. What about the rest of your stack?

HTTP

Auto-instrumentation handles this. The OTel SDK patches your HTTP client library (requests, net/http, axios) to inject traceparent on every outgoing call and extract it on every incoming call. You get this for free:

# This is all you need — auto-instrumentation does the rest
from opentelemetry.instrumentation.requests import RequestsInstrumentor
RequestsInstrumentor().instrument()

# Now every requests.get() / requests.post() propagates trace context

gRPC

gRPC uses metadata (essentially headers). OTel's gRPC instrumentation injects trace context into gRPC metadata automatically. This almost always works — unless you have a custom interceptor that strips unknown metadata keys.

// Go gRPC client with OTel interceptor
conn, err := grpc.Dial(
    "inventory-service:50051",
    grpc.WithUnaryInterceptor(otelgrpc.UnaryClientInterceptor()),
    grpc.WithStreamInterceptor(otelgrpc.StreamClientInterceptor()),
)

Message Queues — Where Traces Go to Die

This is where most teams' traces break. Your notification-service publishes to Kafka. The email-worker consumes from Kafka. Auto-instrumentation usually does NOT handle message queues automatically.

You must manually inject and extract:

# Producer side — inject trace context into Kafka headers
from opentelemetry import context, trace
from opentelemetry.propagate import inject

tracer = trace.get_tracer("notification-service")

with tracer.start_as_current_span("kafka.produce") as span:
    headers = []
    inject(carrier=headers,
           setter=lambda carrier, key, value: carrier.append((key, value.encode())))
    producer.send("email-notifications", value=message, headers=headers)
# Consumer side — extract trace context from Kafka headers
from opentelemetry.propagate import extract

def process_message(msg):
    header_dict = {k: v.decode() for k, v in msg.headers()}
    ctx = extract(carrier=header_dict)

    with tracer.start_as_current_span("email.send", context=ctx) as span:
        send_email(msg.value())

Gotcha: On the consumer side, you must create the new span as a child of the extracted context — not as a root span. If you write tracer.start_as_current_span("email.send") without passing context=ctx, you get a disconnected trace. Two halves of the same request appear as unrelated traces in your backend. This is the #1 cause of "missing spans" in async pipelines.

Service Mesh: Free Spans, But With a Catch

If you run Istio or Linkerd, the Envoy sidecars automatically generate spans for every inbound and outbound request. But there is a critical requirement: your application must forward incoming trace headers on outbound calls. The sidecar can see inbound and outbound traffic, but it cannot correlate them unless the same traceparent header appears in both.

Without header forwarding:
  [Sidecar] → inbound span (trace A)
  [App]     → does work, makes outbound call WITHOUT forwarding headers
  [Sidecar] → outbound span (trace B)  ← NEW trace, disconnected

With header forwarding:
  [Sidecar] → inbound span (trace A)
  [App]     → forwards traceparent on outbound call
  [Sidecar] → outbound span (trace A)  ← same trace, connected

The mesh gives you observability for free — but only if your code cooperates.


Flashcard Check #1

Cover the answers. Test yourself.

Question Answer
What are the four fields in a traceparent header? Version, trace-id (32 hex), parent-id (16 hex), trace-flags (2 hex)
What happens if one service in a 12-service chain is not instrumented? The trace splits into two disconnected fragments at that service
Why does auto-instrumentation usually fail for message queues? The producer and consumer are decoupled — there is no HTTP/gRPC call for the SDK to patch. You must manually inject/extract context from message headers
What must an application do for service mesh tracing to work? Forward incoming trace headers (traceparent) on all outbound calls

The OTel Collector: Your Telemetry Pipeline

Your instrumented services produce spans. But they do not send spans directly to Jaeger or Tempo. They send them to the OpenTelemetry Collector — the central nervous system of your observability pipeline.

The collector has three stages:

┌─────────────────────────────────────────────────────────┐
│                    OTel Collector                         │
│                                                          │
│   Receivers ──────▶ Processors ──────▶ Exporters         │
│                                                          │
│   "Accept data       "Transform,         "Send to        │
│    in any format"     filter, batch"      backends"       │
└─────────────────────────────────────────────────────────┘

Think of it as a Unix pipeline for telemetry: data flows in, gets transformed, flows out. The power is in the composability — you can receive Jaeger-format traces, filter out health check spans, batch them for efficiency, and export to both Grafana Tempo and Datadog simultaneously.

A Production Collector Config

Here is the collector configuration for the checkout flow scenario. Read every line — this is what stands between your services and your dashboards:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317    # Where services send traces (gRPC)
      http:
        endpoint: 0.0.0.0:4318    # Where services send traces (HTTP)

processors:
  # MUST be first — prevents the collector from OOM-ing
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  # Enrich spans with Kubernetes metadata
  k8sattributes:
    auth_type: "serviceAccount"
    extract:
      metadata:
        - k8s.pod.name
        - k8s.namespace.name
        - k8s.deployment.name

  # Drop health check noise — no point tracing /healthz 50 times per minute
  filter:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.target"] == "/healthz"'
        - 'attributes["http.target"] == "/readyz"'

  # MUST be last — batch before sending to reduce network calls
  batch:
    send_batch_size: 1024
    timeout: 5s

exporters:
  otlp/tempo:
    endpoint: tempo.monitoring:4317
    tls:
      insecure: false
  debug:
    verbosity: basic              # Never use 'detailed' in prod

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679       # Live pipeline debugging

service:
  extensions: [health_check, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, filter, batch]
      exporters: [otlp/tempo, debug]

Remember: Processor order matters. The canonical safe order is: memory_limiter -> k8sattributes/resource -> filter -> batch. Think: guard, enrich, filter, optimize. If you put batch before memory_limiter, a traffic spike fills the batch buffer and the collector OOMs before the limiter can act.

Gotcha: There are two collector distributions: core (otelcol) and contrib (otelcol-contrib). Core has only basic components. If your config references the k8sattributes processor or loki exporter and you deployed core, the collector exits immediately with "unknown component." Check your image:

kubectl get deploy otel-collector -n monitoring \
  -o jsonpath='{.spec.template.spec.containers[0].image}'
# Core:    otel/opentelemetry-collector:0.96.0
# Contrib: otel/opentelemetry-collector-contrib:0.96.0


OTLP: The Wire Protocol

Your services talk to the collector using OTLP (OpenTelemetry Protocol). It is a binary protocol (protobuf over gRPC on port 4317, or protobuf/JSON over HTTP on port 4318) designed specifically for telemetry data.

Why does this matter? Because OTLP is becoming the universal language of observability. Datadog, New Relic, Grafana Cloud, and AWS CloudWatch all accept OTLP natively. Instrument once with OTel, send anywhere.

# Verify OTLP endpoints are listening
ss -tlnp | grep -E '(4317|4318)'

# Send a test span to check the pipeline
curl -X POST http://localhost:4318/v1/traces \
  -H "Content-Type: application/json" \
  -d '{"resourceSpans":[]}'
# 200 = pipeline is alive. 4xx/5xx = dig deeper.

Trivia: OTLP replaced a fragmented landscape where every vendor had its own ingestion format. The OTel Collector can still receive Jaeger Thrift, Zipkin JSON, and Prometheus scrape formats via different receivers — acting as a universal adapter during migrations. But for new instrumentation, OTLP is always the right choice.


Auto-Instrumentation: Tracing Without Touching Code

One of OTel's best features: you can add tracing to existing services without modifying a single line of application code.

Each language uses its own mechanism:

Language Mechanism What it patches
Java Java agent (-javaagent:) HTTP clients, JDBC, gRPC, Kafka, Spring
Python Monkey-patching (opentelemetry-bootstrap) requests, Flask, Django, psycopg2
.NET CLR profiler hooks HttpClient, ADO.NET, gRPC
Node.js Module loader hooks http, express, pg, ioredis
Go Manual wrapping (no agent) net/http, database/sql, gRPC

Go is the outlier. There is no agent that patches Go binaries at runtime — you must wrap your HTTP handlers and clients explicitly. Auto-instrumentation for Go means importing OTel wrapper packages, not zero-code injection.

# Python: zero-code instrumentation in 3 commands
# pip install opentelemetry-distro opentelemetry-exporter-otlp
# opentelemetry-bootstrap -a install
# opentelemetry-instrument python app.py

# That's it. Every Flask route, every requests.get(), every psycopg2 query
# now produces spans — with trace context propagation included.

Auto-instrumentation gives you the skeleton of your traces. For the meat — business logic like order IDs, payment amounts, user tiers — you add manual spans:

tracer = trace.get_tracer("checkout-service")

with tracer.start_as_current_span("process_checkout") as span:
    span.set_attribute("order.id", "ord-98234")
    span.set_attribute("order.total_cents", 4599)
    span.set_attribute("order.item_count", 3)
    span.set_attribute("customer.tier", "premium")

    try:
        result = charge_payment(order)
        span.set_attribute("payment.provider", "stripe")
        span.set_attribute("payment.status", "success")
    except PaymentError as e:
        span.set_status(StatusCode.ERROR, str(e))
        span.record_exception(e)
        raise

Gotcha: OTel semantic conventions define standardized attribute names like http.request.method and db.system. These conventions underwent a major rename in 2023 (e.g., http.method became http.request.method). If you upgrade your SDK and your dashboards break, the semantic convention migration is the first thing to check. Pin your SDK version across all services to avoid one team using old names and another using new.


Sampling: The Economics of Tracing

At 10,000 requests per second across 5 services, with an average of 8 spans per trace, you produce 80,000 spans per second. At roughly 1KB per span, that is 6.6 TB per day. At typical vendor pricing ($0.30–$1.50 per GB ingested), that is $2,000–$10,000 per day.

You cannot afford to keep everything. Sampling is how you control costs while keeping the traces that matter.

Head Sampling: Cheap But Blind

Decide at the trace's birth whether to keep it. Simple. Fast. No buffering required.

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

sampler = TraceIdRatioBased(0.1)  # Keep 10%
provider = TracerProvider(sampler=sampler, resource=resource)

The problem: a 10% sampler drops 90% of traces before seeing them. If a rare payment error happens once per 500 requests, you need 50 occurrences (5,000 requests) to have a reasonable chance of capturing one trace showing it.

Tail Sampling: Smart But Hungry

Decide after the trace completes, based on what happened:

# Collector config — tail_sampling processor
processors:
  tail_sampling:
    decision_wait: 10s          # Buffer spans for 10s waiting for the full trace
    num_traces: 100000          # Max traces held in memory
    policies:
      - name: keep-errors
        type: status_code
        status_code:
          status_codes: [ERROR]      # Always keep error traces
      - name: keep-slow
        type: latency
        latency:
          threshold_ms: 2000         # Always keep traces > 2s
      - name: keep-payments
        type: string_attribute
        string_attribute:
          key: service.name
          values: [payment-service]  # Always keep payment traces
      - name: sample-the-rest
        type: probabilistic
        probabilistic:
          sampling_percentage: 5     # Keep 5% of normal traces

This is powerful: you keep 100% of errors, 100% of slow traces, 100% of payment traces, and 5% of everything else. Your storage costs drop by ~90% while your debugging capability barely changes.

Under the Hood: Tail sampling requires the collector to buffer complete traces in memory before making the decision. This means all spans for a single trace must arrive at the same collector instance. In a multi-replica gateway deployment, you need a load-balancing exporter on your agents that routes spans by trace ID:

# Agent collector config — route by trace ID
exporters:
  loadbalancing:
    protocol:
      otlp:
        endpoint: gateway-collector:4317
    resolver:
      dns:
        hostname: gateway-collector-headless
        port: 4317

Without this, the gateway sees incomplete traces and makes bad sampling decisions — keeping half a trace or dropping an error trace because the error span landed on a different instance.


War Story: The Missing Spans

War Story: A fintech company instrumented all their services with OTel and deployed head-based sampling at 10%. For months, everything seemed fine. Then a subtle bug appeared in their currency conversion service — it double-charged customers on cross-border transactions, roughly 1 in 2,000 requests. The team searched for traces showing the bug. At 10% sampling, they needed the bug to occur 20 times to statistically capture one trace. It took three weeks to find a single trace showing the double-charge. By then, they had overcharged 847 customers.

The fix was two-fold: they switched to tail-based sampling (keeping 100% of traces where the payment amount differed from the order amount), and they added a custom span attribute payment.amount_mismatch=true that the tail sampler could key on. The next occurrence was captured in the first trace.

The lesson: head sampling is a bet that interesting traces are common. For rare bugs, that bet loses. Tail sampling lets you define "interesting" after the fact.


Flashcard Check #2

Question Answer
What is the canonical safe order for OTel Collector processors? memory_limiter -> enrich -> filter -> batch (guard, enrich, filter, optimize)
Why is tail sampling "hungry"? It must buffer complete traces in memory before deciding, requiring all spans for one trace to arrive at the same collector instance
At 10% head sampling, how many occurrences of a 1-in-500 bug do you need to likely capture one trace? Roughly 50 (because only 10% of the 500 requests per occurrence are sampled)
What are OTLP's two transport options? gRPC on port 4317 and HTTP on port 4318

Tracing Backends: Where Spans Live

Once spans leave the collector, they need a home. Three open-source options dominate:

Backend Storage model Best for Trade-off
Jaeger Elasticsearch, Cassandra, or in-memory Teams already running ES/Cassandra; rich query UI Storage cost grows with indexing
Grafana Tempo Object storage (S3, GCS, Azure Blob) Massive scale; cost-sensitive environments No indexing — find traces via ID, or discover via logs/metrics
Zipkin In-memory, MySQL, Cassandra, ES Small teams; getting started fast Simpler feature set

Name Origin: Jaeger is the German word for "hunter" — fitting for a tool that helps you hunt down latency across distributed systems. Uber created it in 2015 and donated it to the CNCF; it graduated in 2019.

Tempo's architectural bet is radical: it stores traces without indexing. You find traces either by trace ID (from a log line or metric exemplar) or via Grafana's TraceQL query language. This is much cheaper at scale but requires you to correlate from other signals.

# Tempo config — object storage backend
storage:
  trace:
    backend: s3
    s3:
      bucket: company-traces
      endpoint: s3.us-east-1.amazonaws.com
    blocklist_poll: 5m

Mental Model: Jaeger is like a library with a card catalog — you can search by service, operation, tags, duration. Tempo is like a warehouse with aisle numbers — you need to know the trace ID (aisle number) to find what you're looking for, but it can store vastly more traces for less money.


Correlating Traces with Logs and Metrics

The real power of OTel is not traces or logs or metrics. It is the connections between them.

Traces in Logs

Inject trace ID and span ID into every log line. When you find an error in your logs, one click takes you to the full distributed trace:

{
  "timestamp": "2026-03-23T15:15:42Z",
  "severity": "ERROR",
  "body": "inventory check failed: connection refused to postgres:5432",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "a1b2c3d4e5f60718",
  "resource": {
    "service.name": "inventory-service",
    "service.version": "3.2.1",
    "k8s.pod.name": "inventory-service-7d4b8f9c6-xk2pm",
    "deployment.environment": "production"
  }
}

Metrics with Exemplars

Prometheus supports exemplars — a trace ID attached to a specific metric observation. When you see a latency spike in Grafana, you can click through to an actual trace from that exact time window:

# A histogram observation with an exemplar
http_request_duration_seconds_bucket{le="0.5"} 2100
http_request_duration_seconds_bucket{le="1.0"} 2350 # {trace_id="4bf92f..."}

The workflow becomes: Alert fires (metrics) -> View the spike in Grafana (metrics) -> Click an exemplar to see a real trace (traces) -> Find the slow span -> Click the log icon to see the error details (logs). Three pillars, one investigation.


Deployment Architecture: Agent + Gateway

For production Kubernetes, the recommended pattern uses two tiers of collectors:

┌──────────────────────────────────────────────────────────┐
│  Node 1                   Node 2                         │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐       │
│  │ App Pod │ │ App Pod │ │ App Pod │ │ App Pod │       │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘       │
│       └──────┬────┘           └──────┬────┘             │
│         ┌────▼─────┐           ┌─────▼────┐             │
│         │  Agent   │           │  Agent   │  DaemonSet  │
│         │Collector │           │Collector │  (per-node) │
│         └────┬─────┘           └────┬─────┘             │
│              └──────────┬───────────┘                    │
│                    ┌────▼────────┐                       │
│                    │   Gateway   │  Deployment           │
│                    │  Collector  │  (2-3 replicas)       │
│                    │ (tail samp, │                       │
│                    │  enrichment)│                       │
│                    └────┬───────┘                        │
│                         │                                │
│                    ┌────▼───────┐                        │
│                    │  Backend   │  Tempo / Jaeger         │
│                    └────────────┘                        │
└──────────────────────────────────────────────────────────┘

Agents (DaemonSet): lightweight, handle basic processing (memory limiting, batching), forward to the gateway. If an agent crashes, only one node's telemetry is affected.

Gateway (Deployment): handles expensive operations — tail sampling, k8s attribute enrichment, fan-out to multiple backends. Scaled horizontally.

War Story: A fintech company ran a single OTel Collector pod. During a production incident, the collector pod was evicted due to node resource pressure — exactly when they needed traces to diagnose the root cause. They had zero telemetry for the 12-minute window that mattered most. The fix: DaemonSet deployment with priorityClassName: system-node-critical to prevent eviction, plus a multi-replica gateway behind a headless service.


Back to the Mission: Finding the Bottleneck

Let's solve the checkout slowness. You pull up Jaeger and search for slow traces:

# Find slow checkout traces (> 3 seconds)
curl -s "http://jaeger:16686/api/traces?service=api-gateway\
&operation=POST+%2Fapi%2Fcheckout&limit=10&minDuration=3000000" \
  | jq '.data[0].traceID'
# → "4bf92f3577b34da6a3ce929d0e0e4736"

You open the trace. The waterfall view shows:

api-gateway     ████████████████████████████████████████ 4210ms
 cart-service   ██                                        85ms
 inventory-svc  ██████████████████████████████████████   3802ms
   cache.lookup ▏                                          4ms
   db.query     █████████████████████████████████████    3791ms
 payment-svc    █████                                    180ms
   stripe.call  ████                                     142ms
 notif-svc      █                                         38ms
   kafka.produce▏                                         12ms

The culprit is inventory-svc -> db.query at 3,791ms. You check the span attributes:

{
  "db.system": "postgresql",
  "db.statement": "SELECT * FROM inventory WHERE product_id IN ($1,$2,...,$347)",
  "db.operation": "SELECT",
  "db.rows_affected": 347
}

A sequential scan on 347 product IDs. You check across 10 slow traces — same pattern: bulk orders with 200+ items always hit the slow path.

The fix: CREATE INDEX CONCURRENTLY ON inventory (product_id);. No downtime, no maintenance window. PostgreSQL builds the index while serving queries.

But you only found this in minutes because the trace showed you exactly where the time went. Without tracing, you'd be checking each service's metrics, guessing, and SSH-ing into pods.

Interview Bridge: "Walk me through how you'd debug a slow API response in a microservices environment" is a common interview question. The answer that separates juniors from seniors: juniors check each service's metrics individually. Seniors pull a distributed trace and look at the span waterfall. Traces give you the answer in seconds; per-service metric hunting takes hours.


The Environment Variable Trap

OTel SDKs read configuration from environment variables. But typos are silent — there is no error, just no data:

# Correct
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
OTEL_SERVICE_NAME=inventory-service
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,service.version=3.2.1

# Wrong — and you will get zero errors, zero warnings, just no data
OTEL_EXPORTER_ENDPOINT=http://collector:4317        # Missing 'OTLP'
OTEL_SERVICE=inventory-service                       # Missing '_NAME'
OTEL_RESOURCE_ATTRS=deployment.environment=production # Wrong suffix

If your service shows up in Jaeger as unknown_service, the env var is wrong.

Remember: Always validate with the debug exporter first. Add debug to your pipeline's exporters list and check the collector's stdout. If you see your spans with the right service name and attributes in the debug output, the instrumentation is correct and the problem is downstream (exporter, backend, network).


Semantic Conventions: The Shared Vocabulary

OTel defines standardized attribute names so that http.request.method means the same thing whether it comes from a Go service or a Python service:

Namespace Example Attributes When You'll See Them
http. http.request.method, http.response.status_code Every HTTP span
db. db.system, db.statement, db.operation Database queries
rpc. rpc.system, rpc.method, rpc.service gRPC/Thrift calls
messaging. messaging.system, messaging.operation Kafka/RabbitMQ spans
k8s. k8s.pod.name, k8s.namespace.name After k8sattributes processor
service. service.name, service.version Resource attributes (required)

If your team invents httpMethod instead of http.request.method, every cross-service dashboard query breaks because it expects the standard name. Semantic conventions are not optional — they are what make cross-service correlation work.


Flashcard Check #3

Question Answer
What is the difference between the OTel Collector core and contrib distributions? Core has minimal components; contrib includes processors/exporters for Loki, Kafka, AWS, k8s attributes, and dozens more. Use contrib unless you have a specific reason not to.
What happens if you set OTEL_SERVICE instead of OTEL_SERVICE_NAME? Nothing — the SDK ignores the unrecognized variable silently, and your service appears as unknown_service
How does Grafana Tempo find traces without indexing? By trace ID (from a log line, metric exemplar, or TraceQL query). It trades searchability for dramatically lower storage cost.
What are the two tiers in the Agent + Gateway collector pattern? Agents (DaemonSet, per-node, lightweight) and Gateway (Deployment, centralized, handles tail sampling and enrichment)

Exercises

Exercise 1: Read a traceparent Header (2 minutes)

Given this header:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

  1. What is the trace ID?
  2. What is the parent span ID?
  3. Is this trace sampled?
Answer 1. `0af7651916cd43dd8448eb211c80319c` (the 32-hex-char field) 2. `b7ad6b7169203331` (the 16-hex-char field) 3. Yes — the flags field is `01` (sampled)

Exercise 2: Fix the Collector Config (5 minutes)

This collector config has three bugs. Find them:

processors:
  batch:
    send_batch_size: 1024
    timeout: 5s
  memory_limiter:
    check_interval: 1s
    limit_mib: 512

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, memory_limiter]
      exporters: [debug]
Answer 1. **Processor order is wrong.** `memory_limiter` must come before `batch`. Reverse them: `processors: [memory_limiter, batch]` 2. **Debug verbosity is `detailed` in what could be a production config.** In production, this generates gigabytes of log output per hour. Use `basic` or remove the debug exporter entirely. 3. **No OTLP receiver is defined.** The pipeline references `[otlp]` as a receiver, but the receivers section is missing. Add:
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

Exercise 3: Design a Sampling Strategy (10 minutes)

Your company processes 50,000 requests/second across 30 microservices. Your tracing backend budget is $3,000/month. At current ingestion rates with 100% sampling, you'd spend $45,000/month.

Design a tail-sampling policy that: - Keeps all error traces - Keeps all traces over 3 seconds - Keeps all traces touching the payment service - Stays within budget

Hint You need to reduce volume by roughly 93% (from $45k to $3k). That means your baseline probabilistic sampling should be around 5-7%. Calculate: errors + slow traces + payment traces probably account for 2-5% of total traffic, so your probabilistic sample of the remaining 95% needs to bring total volume to ~7% of original.
Solution
processors:
  tail_sampling:
    decision_wait: 15s
    num_traces: 200000
    policies:
      - name: keep-errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: keep-slow
        type: latency
        latency:
          threshold_ms: 3000
      - name: keep-payments
        type: string_attribute
        string_attribute:
          key: service.name
          values: [payment-service]
      - name: baseline
        type: probabilistic
        probabilistic:
          sampling_percentage: 3
At 3% probabilistic + errors + slow + payments, you're likely at 5-8% of total volume, well within the $3k budget. Monitor `otelcol_processor_tail_sampling_count_traces_sampled` and adjust the percentage quarterly.

Cheat Sheet

What Command / Config Notes
Check collector health curl http://localhost:13133/health Only confirms process is running, not pipeline health
Check spans received curl localhost:8888/metrics \| grep otelcol_receiver_accepted_spans 0 = apps not sending
Check spans exported curl localhost:8888/metrics \| grep otelcol_exporter_sent_spans Compare with received — gap = drops
Check for drops curl localhost:8888/metrics \| grep -E "(dropped\|refused\|failed)" Non-zero during incidents = you're losing data
Verify OTLP ports ss -tlnp \| grep -E '(4317\|4318)' 4317=gRPC, 4318=HTTP
Validate config otelcol validate --config=config.yaml Run before applying to cluster
Live pipeline debug curl http://localhost:55679/debug/pipelinez Requires zpages extension
Set service name OTEL_SERVICE_NAME=my-service Most common env var to get wrong
Set propagator OTEL_PROPAGATORS=tracecontext,baggage Default is W3C; set explicitly if mixing with B3/Zipkin
Core vs contrib Image tag: otel/opentelemetry-collector-contrib Use contrib unless you know you only need core

traceparent Quick Reference

00-{trace-id-32hex}-{parent-id-16hex}-{flags-2hex}
    ▲                  ▲                 ▲
    128-bit            64-bit            01=sampled
    shared by          unique per        00=not sampled
    all spans          span

Processor Order

memory_limiter → k8sattributes → resource → filter → transform → tail_sampling → batch
     guard          enrich         enrich    reduce     reshape      select         optimize

Takeaways

  1. Traces show causality. Metrics tell you something is slow. Traces tell you which span in which service is slow — in seconds, not hours.

  2. Context propagation is the linchpin. One uninstrumented service, one missing header, one async boundary without manual injection — and your trace splits into useless fragments. Verify propagation at every boundary.

  3. Tail sampling beats head sampling for debugging. Head sampling is blind — it might drop the one trace you need. Tail sampling keeps errors and slow traces by design. The cost is memory in the collector.

  4. The collector is a pipeline, not a proxy. Receivers, processors, exporters — and the order of processors matters. Guard (memory limiter) first, optimize (batch) last.

  5. OTLP is the universal language. Instrument with OTel, export via OTLP, and you can switch backends without re-instrumenting. This is the real vendor-neutral promise.

  6. Correlate all three signals. A trace ID in your logs, an exemplar in your metrics, a span linked to a log entry. The three pillars are powerful alone; they are transformative together.


  • The Mysterious Latency Spike — when the latency problem is in the infrastructure layer (CPU throttling, GC, disk I/O, noisy neighbors) rather than the application layer
  • The Cascading Timeout — what happens when one slow service brings down the entire platform via thread pool exhaustion and retry storms
  • The Monitoring That Lied — when your observability stack itself is the problem: dashboards showing green while production burns