OpenTelemetry: Following a Request Across Services
- lesson
- distributed-tracing
- opentelemetry
- context-propagation
- collector-pipelines
- sampling
- kubernetes-networking
- service-mesh
- application-instrumentation
- l2 ---# OpenTelemetry: Following a Request Across Services
Topics: distributed tracing, OpenTelemetry, context propagation, collector pipelines, sampling, Kubernetes networking, service mesh, application instrumentation Level: L2 (Operations) Time: 70–90 minutes Strategy: End-to-end trace (meta: tracing the trace)
The Mission¶
It's 3:15 PM. A Slack message from the payments team: "Checkout is slow. Users are abandoning carts. We don't know which service is the bottleneck."
Your checkout flow involves five microservices:
User Browser
└── api-gateway (Go)
├── cart-service (Python)
├── inventory-service (Java)
├── payment-service (Go)
│ └── stripe-api (external)
└── notification-service (Node.js)
└── Kafka → email-worker (Python)
Five services. Four languages. One message queue. An external API call. And somewhere in that chain, something is eating 4 seconds that should take 400 milliseconds.
You have OpenTelemetry instrumented across these services. Your job: follow a single request from the browser click to the confirmation email, find the bottleneck, and fix it.
But first, you need to understand how tracing works — because if you don't understand the plumbing, you can't trust the data.
The Three Pillars: What Each One Can (and Can't) Tell You¶
Before diving into the trace, a 60-second orientation.
| Pillar | What it answers | What it can't tell you |
|---|---|---|
| Metrics | "Error rate is 2%. P99 latency is 4.2s." | Which requests are slow, or why |
| Logs | "payment-service: timeout connecting to Stripe" | Was this the cause or a symptom? What called it? |
| Traces | "This request spent 3.8s in inventory-service's DB query" | Long-term trends, aggregate patterns |
Metrics detect. Logs explain. Traces locate. You need all three, but traces are the only pillar that shows causality across service boundaries.
Mental Model: Think of metrics as a thermometer (tells you there's a fever), logs as a blood test (tells you what's wrong), and traces as an X-ray (shows you exactly where the problem is). You wouldn't diagnose a broken bone with a blood test.
Trivia: The "three pillars" framing was popularized by Peter Bourgon in his 2017 blog post "Metrics, Tracing, and Logging." Charity Majors (Honeycomb co-founder) has argued the framing is misleading because it implies the three are equal and independent — in practice, traces provide the richest debugging context, and the real power comes from correlating all three via shared trace IDs.
Traces and Spans: The Data Model¶
A trace is the full journey of one request through your system. A span is one unit of work within that trace — an HTTP handler, a database query, a cache lookup.
Here is a real trace from your slow checkout:
Trace ID: 4bf92f3577b34da6a3ce929d0e0e4736
├── [api-gateway] POST /api/checkout 4210ms
│ ├── [cart-service] getCart 85ms
│ ├── [inventory-svc] checkAvailability 3802ms ← HERE
│ │ ├── cache.lookup 4ms
│ │ └── db.query 3791ms ← AND HERE
│ ├── [payment-svc] chargeCard 180ms
│ │ └── stripe.api.call 142ms
│ └── [notif-svc] sendConfirmation 38ms
│ └── kafka.produce 12ms
Every span carries structured data:
| Field | Example | Purpose |
|---|---|---|
| Trace ID | 4bf92f3577b34da6a3ce929d0e0e4736 |
Links all spans in one request (32 hex chars = 128 bits) |
| Span ID | 00f067aa0ba902b7 |
Unique ID for this span (16 hex chars = 64 bits) |
| Parent Span ID | a1b2c3d4e5f60718 |
Which span created this one |
| Operation | POST /api/checkout |
What this span represents |
| Duration | 4210ms |
Wall-clock time |
| Attributes | http.status_code=200 |
Key-value metadata |
| Status | OK, ERROR, or Unset |
Did it succeed? |
Under the Hood: A trace ID is a 128-bit random number. The birthday paradox says you'd need roughly 2^64 (18 quintillion) traces before a 50% chance of collision. At 1 million traces per second, that's 584,942 years. You will not run out of trace IDs.
Found it already. The inventory-svc span took 3,802ms, and within it, db.query took
3,791ms. That is a database problem masquerading as a checkout problem. But how did the
trace data get here? That is the interesting part.
Context Propagation: How Trace IDs Cross Service Boundaries¶
Here is the fundamental problem of distributed tracing: Service A starts a trace. Service A calls Service B. How does Service B know it is part of the same trace?
The answer: HTTP headers.
The W3C traceparent Header¶
When api-gateway calls cart-service, it includes this header:
Let's break that apart:
00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
│ │ │ │
│ │ │ └── trace-flags (01 = sampled)
│ │ │
│ │ └── parent-id (16 hex = this span's ID)
│ │
│ └── trace-id (32 hex = shared across all spans in this trace)
│
└── version (always 00 for now)
When cart-service receives this header, it:
1. Extracts the trace ID (4bf92f...)
2. Extracts the parent span ID (00f067...)
3. Creates a new span with a new span ID, setting parent to 00f067...
4. Attaches the same trace ID
5. When calling downstream services, injects an updated traceparent with its own span ID as the new parent
This is the chain that connects every span in the trace. Break it at any point and the trace splits into disconnected fragments.
Name Origin: The W3C Trace Context specification was ratified as a W3C Recommendation in February 2020. Before it existed, every tracing system had its own propagation format: Zipkin used
X-B3-TraceId, Jaeger useduber-trace-id, AWS usedX-Amzn-Trace-Id. Cross-vendor tracing was impossible. W3C Trace Context ended the format wars — today,traceparentis the lingua franca of distributed tracing.
There is also a tracestate header for vendor-specific data:
Most of the time you can ignore tracestate. The traceparent header is what matters.
Propagation Across Protocols: HTTP, gRPC, and Message Queues¶
HTTP headers are the easy case. What about the rest of your stack?
HTTP¶
Auto-instrumentation handles this. The OTel SDK patches your HTTP client library
(requests, net/http, axios) to inject traceparent on every outgoing call and
extract it on every incoming call. You get this for free:
# This is all you need — auto-instrumentation does the rest
from opentelemetry.instrumentation.requests import RequestsInstrumentor
RequestsInstrumentor().instrument()
# Now every requests.get() / requests.post() propagates trace context
gRPC¶
gRPC uses metadata (essentially headers). OTel's gRPC instrumentation injects trace context into gRPC metadata automatically. This almost always works — unless you have a custom interceptor that strips unknown metadata keys.
// Go gRPC client with OTel interceptor
conn, err := grpc.Dial(
"inventory-service:50051",
grpc.WithUnaryInterceptor(otelgrpc.UnaryClientInterceptor()),
grpc.WithStreamInterceptor(otelgrpc.StreamClientInterceptor()),
)
Message Queues — Where Traces Go to Die¶
This is where most teams' traces break. Your notification-service publishes to Kafka.
The email-worker consumes from Kafka. Auto-instrumentation usually does NOT handle
message queues automatically.
You must manually inject and extract:
# Producer side — inject trace context into Kafka headers
from opentelemetry import context, trace
from opentelemetry.propagate import inject
tracer = trace.get_tracer("notification-service")
with tracer.start_as_current_span("kafka.produce") as span:
headers = []
inject(carrier=headers,
setter=lambda carrier, key, value: carrier.append((key, value.encode())))
producer.send("email-notifications", value=message, headers=headers)
# Consumer side — extract trace context from Kafka headers
from opentelemetry.propagate import extract
def process_message(msg):
header_dict = {k: v.decode() for k, v in msg.headers()}
ctx = extract(carrier=header_dict)
with tracer.start_as_current_span("email.send", context=ctx) as span:
send_email(msg.value())
Gotcha: On the consumer side, you must create the new span as a child of the extracted context — not as a root span. If you write
tracer.start_as_current_span("email.send")without passingcontext=ctx, you get a disconnected trace. Two halves of the same request appear as unrelated traces in your backend. This is the #1 cause of "missing spans" in async pipelines.
Service Mesh: Free Spans, But With a Catch¶
If you run Istio or Linkerd, the Envoy sidecars automatically generate spans for every
inbound and outbound request. But there is a critical requirement: your application must
forward incoming trace headers on outbound calls. The sidecar can see inbound and
outbound traffic, but it cannot correlate them unless the same traceparent header appears
in both.
Without header forwarding:
[Sidecar] → inbound span (trace A)
[App] → does work, makes outbound call WITHOUT forwarding headers
[Sidecar] → outbound span (trace B) ← NEW trace, disconnected
With header forwarding:
[Sidecar] → inbound span (trace A)
[App] → forwards traceparent on outbound call
[Sidecar] → outbound span (trace A) ← same trace, connected
The mesh gives you observability for free — but only if your code cooperates.
Flashcard Check #1¶
Cover the answers. Test yourself.
| Question | Answer |
|---|---|
What are the four fields in a traceparent header? |
Version, trace-id (32 hex), parent-id (16 hex), trace-flags (2 hex) |
| What happens if one service in a 12-service chain is not instrumented? | The trace splits into two disconnected fragments at that service |
| Why does auto-instrumentation usually fail for message queues? | The producer and consumer are decoupled — there is no HTTP/gRPC call for the SDK to patch. You must manually inject/extract context from message headers |
| What must an application do for service mesh tracing to work? | Forward incoming trace headers (traceparent) on all outbound calls |
The OTel Collector: Your Telemetry Pipeline¶
Your instrumented services produce spans. But they do not send spans directly to Jaeger or Tempo. They send them to the OpenTelemetry Collector — the central nervous system of your observability pipeline.
The collector has three stages:
┌─────────────────────────────────────────────────────────┐
│ OTel Collector │
│ │
│ Receivers ──────▶ Processors ──────▶ Exporters │
│ │
│ "Accept data "Transform, "Send to │
│ in any format" filter, batch" backends" │
└─────────────────────────────────────────────────────────┘
Think of it as a Unix pipeline for telemetry: data flows in, gets transformed, flows out. The power is in the composability — you can receive Jaeger-format traces, filter out health check spans, batch them for efficiency, and export to both Grafana Tempo and Datadog simultaneously.
A Production Collector Config¶
Here is the collector configuration for the checkout flow scenario. Read every line — this is what stands between your services and your dashboards:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317 # Where services send traces (gRPC)
http:
endpoint: 0.0.0.0:4318 # Where services send traces (HTTP)
processors:
# MUST be first — prevents the collector from OOM-ing
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
# Enrich spans with Kubernetes metadata
k8sattributes:
auth_type: "serviceAccount"
extract:
metadata:
- k8s.pod.name
- k8s.namespace.name
- k8s.deployment.name
# Drop health check noise — no point tracing /healthz 50 times per minute
filter:
error_mode: ignore
traces:
span:
- 'attributes["http.target"] == "/healthz"'
- 'attributes["http.target"] == "/readyz"'
# MUST be last — batch before sending to reduce network calls
batch:
send_batch_size: 1024
timeout: 5s
exporters:
otlp/tempo:
endpoint: tempo.monitoring:4317
tls:
insecure: false
debug:
verbosity: basic # Never use 'detailed' in prod
extensions:
health_check:
endpoint: 0.0.0.0:13133
zpages:
endpoint: 0.0.0.0:55679 # Live pipeline debugging
service:
extensions: [health_check, zpages]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, filter, batch]
exporters: [otlp/tempo, debug]
Remember: Processor order matters. The canonical safe order is:
memory_limiter->k8sattributes/resource->filter->batch. Think: guard, enrich, filter, optimize. If you putbatchbeforememory_limiter, a traffic spike fills the batch buffer and the collector OOMs before the limiter can act.Gotcha: There are two collector distributions: core (
otelcol) and contrib (otelcol-contrib). Core has only basic components. If your config references thek8sattributesprocessor orlokiexporter and you deployed core, the collector exits immediately with "unknown component." Check your image:
OTLP: The Wire Protocol¶
Your services talk to the collector using OTLP (OpenTelemetry Protocol). It is a binary protocol (protobuf over gRPC on port 4317, or protobuf/JSON over HTTP on port 4318) designed specifically for telemetry data.
Why does this matter? Because OTLP is becoming the universal language of observability. Datadog, New Relic, Grafana Cloud, and AWS CloudWatch all accept OTLP natively. Instrument once with OTel, send anywhere.
# Verify OTLP endpoints are listening
ss -tlnp | grep -E '(4317|4318)'
# Send a test span to check the pipeline
curl -X POST http://localhost:4318/v1/traces \
-H "Content-Type: application/json" \
-d '{"resourceSpans":[]}'
# 200 = pipeline is alive. 4xx/5xx = dig deeper.
Trivia: OTLP replaced a fragmented landscape where every vendor had its own ingestion format. The OTel Collector can still receive Jaeger Thrift, Zipkin JSON, and Prometheus scrape formats via different receivers — acting as a universal adapter during migrations. But for new instrumentation, OTLP is always the right choice.
Auto-Instrumentation: Tracing Without Touching Code¶
One of OTel's best features: you can add tracing to existing services without modifying a single line of application code.
Each language uses its own mechanism:
| Language | Mechanism | What it patches |
|---|---|---|
| Java | Java agent (-javaagent:) |
HTTP clients, JDBC, gRPC, Kafka, Spring |
| Python | Monkey-patching (opentelemetry-bootstrap) |
requests, Flask, Django, psycopg2 |
| .NET | CLR profiler hooks | HttpClient, ADO.NET, gRPC |
| Node.js | Module loader hooks | http, express, pg, ioredis |
| Go | Manual wrapping (no agent) | net/http, database/sql, gRPC |
Go is the outlier. There is no agent that patches Go binaries at runtime — you must wrap your HTTP handlers and clients explicitly. Auto-instrumentation for Go means importing OTel wrapper packages, not zero-code injection.
# Python: zero-code instrumentation in 3 commands
# pip install opentelemetry-distro opentelemetry-exporter-otlp
# opentelemetry-bootstrap -a install
# opentelemetry-instrument python app.py
# That's it. Every Flask route, every requests.get(), every psycopg2 query
# now produces spans — with trace context propagation included.
Auto-instrumentation gives you the skeleton of your traces. For the meat — business logic like order IDs, payment amounts, user tiers — you add manual spans:
tracer = trace.get_tracer("checkout-service")
with tracer.start_as_current_span("process_checkout") as span:
span.set_attribute("order.id", "ord-98234")
span.set_attribute("order.total_cents", 4599)
span.set_attribute("order.item_count", 3)
span.set_attribute("customer.tier", "premium")
try:
result = charge_payment(order)
span.set_attribute("payment.provider", "stripe")
span.set_attribute("payment.status", "success")
except PaymentError as e:
span.set_status(StatusCode.ERROR, str(e))
span.record_exception(e)
raise
Gotcha: OTel semantic conventions define standardized attribute names like
http.request.methodanddb.system. These conventions underwent a major rename in 2023 (e.g.,http.methodbecamehttp.request.method). If you upgrade your SDK and your dashboards break, the semantic convention migration is the first thing to check. Pin your SDK version across all services to avoid one team using old names and another using new.
Sampling: The Economics of Tracing¶
At 10,000 requests per second across 5 services, with an average of 8 spans per trace, you produce 80,000 spans per second. At roughly 1KB per span, that is 6.6 TB per day. At typical vendor pricing ($0.30–$1.50 per GB ingested), that is $2,000–$10,000 per day.
You cannot afford to keep everything. Sampling is how you control costs while keeping the traces that matter.
Head Sampling: Cheap But Blind¶
Decide at the trace's birth whether to keep it. Simple. Fast. No buffering required.
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
sampler = TraceIdRatioBased(0.1) # Keep 10%
provider = TracerProvider(sampler=sampler, resource=resource)
The problem: a 10% sampler drops 90% of traces before seeing them. If a rare payment error happens once per 500 requests, you need 50 occurrences (5,000 requests) to have a reasonable chance of capturing one trace showing it.
Tail Sampling: Smart But Hungry¶
Decide after the trace completes, based on what happened:
# Collector config — tail_sampling processor
processors:
tail_sampling:
decision_wait: 10s # Buffer spans for 10s waiting for the full trace
num_traces: 100000 # Max traces held in memory
policies:
- name: keep-errors
type: status_code
status_code:
status_codes: [ERROR] # Always keep error traces
- name: keep-slow
type: latency
latency:
threshold_ms: 2000 # Always keep traces > 2s
- name: keep-payments
type: string_attribute
string_attribute:
key: service.name
values: [payment-service] # Always keep payment traces
- name: sample-the-rest
type: probabilistic
probabilistic:
sampling_percentage: 5 # Keep 5% of normal traces
This is powerful: you keep 100% of errors, 100% of slow traces, 100% of payment traces, and 5% of everything else. Your storage costs drop by ~90% while your debugging capability barely changes.
Under the Hood: Tail sampling requires the collector to buffer complete traces in memory before making the decision. This means all spans for a single trace must arrive at the same collector instance. In a multi-replica gateway deployment, you need a load-balancing exporter on your agents that routes spans by trace ID:
# Agent collector config — route by trace ID exporters: loadbalancing: protocol: otlp: endpoint: gateway-collector:4317 resolver: dns: hostname: gateway-collector-headless port: 4317Without this, the gateway sees incomplete traces and makes bad sampling decisions — keeping half a trace or dropping an error trace because the error span landed on a different instance.
War Story: The Missing Spans¶
War Story: A fintech company instrumented all their services with OTel and deployed head-based sampling at 10%. For months, everything seemed fine. Then a subtle bug appeared in their currency conversion service — it double-charged customers on cross-border transactions, roughly 1 in 2,000 requests. The team searched for traces showing the bug. At 10% sampling, they needed the bug to occur 20 times to statistically capture one trace. It took three weeks to find a single trace showing the double-charge. By then, they had overcharged 847 customers.
The fix was two-fold: they switched to tail-based sampling (keeping 100% of traces where the payment amount differed from the order amount), and they added a custom span attribute
payment.amount_mismatch=truethat the tail sampler could key on. The next occurrence was captured in the first trace.The lesson: head sampling is a bet that interesting traces are common. For rare bugs, that bet loses. Tail sampling lets you define "interesting" after the fact.
Flashcard Check #2¶
| Question | Answer |
|---|---|
| What is the canonical safe order for OTel Collector processors? | memory_limiter -> enrich -> filter -> batch (guard, enrich, filter, optimize) |
| Why is tail sampling "hungry"? | It must buffer complete traces in memory before deciding, requiring all spans for one trace to arrive at the same collector instance |
| At 10% head sampling, how many occurrences of a 1-in-500 bug do you need to likely capture one trace? | Roughly 50 (because only 10% of the 500 requests per occurrence are sampled) |
| What are OTLP's two transport options? | gRPC on port 4317 and HTTP on port 4318 |
Tracing Backends: Where Spans Live¶
Once spans leave the collector, they need a home. Three open-source options dominate:
| Backend | Storage model | Best for | Trade-off |
|---|---|---|---|
| Jaeger | Elasticsearch, Cassandra, or in-memory | Teams already running ES/Cassandra; rich query UI | Storage cost grows with indexing |
| Grafana Tempo | Object storage (S3, GCS, Azure Blob) | Massive scale; cost-sensitive environments | No indexing — find traces via ID, or discover via logs/metrics |
| Zipkin | In-memory, MySQL, Cassandra, ES | Small teams; getting started fast | Simpler feature set |
Name Origin: Jaeger is the German word for "hunter" — fitting for a tool that helps you hunt down latency across distributed systems. Uber created it in 2015 and donated it to the CNCF; it graduated in 2019.
Tempo's architectural bet is radical: it stores traces without indexing. You find traces either by trace ID (from a log line or metric exemplar) or via Grafana's TraceQL query language. This is much cheaper at scale but requires you to correlate from other signals.
# Tempo config — object storage backend
storage:
trace:
backend: s3
s3:
bucket: company-traces
endpoint: s3.us-east-1.amazonaws.com
blocklist_poll: 5m
Mental Model: Jaeger is like a library with a card catalog — you can search by service, operation, tags, duration. Tempo is like a warehouse with aisle numbers — you need to know the trace ID (aisle number) to find what you're looking for, but it can store vastly more traces for less money.
Correlating Traces with Logs and Metrics¶
The real power of OTel is not traces or logs or metrics. It is the connections between them.
Traces in Logs¶
Inject trace ID and span ID into every log line. When you find an error in your logs, one click takes you to the full distributed trace:
{
"timestamp": "2026-03-23T15:15:42Z",
"severity": "ERROR",
"body": "inventory check failed: connection refused to postgres:5432",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "a1b2c3d4e5f60718",
"resource": {
"service.name": "inventory-service",
"service.version": "3.2.1",
"k8s.pod.name": "inventory-service-7d4b8f9c6-xk2pm",
"deployment.environment": "production"
}
}
Metrics with Exemplars¶
Prometheus supports exemplars — a trace ID attached to a specific metric observation. When you see a latency spike in Grafana, you can click through to an actual trace from that exact time window:
# A histogram observation with an exemplar
http_request_duration_seconds_bucket{le="0.5"} 2100
http_request_duration_seconds_bucket{le="1.0"} 2350 # {trace_id="4bf92f..."}
The workflow becomes: Alert fires (metrics) -> View the spike in Grafana (metrics) -> Click an exemplar to see a real trace (traces) -> Find the slow span -> Click the log icon to see the error details (logs). Three pillars, one investigation.
Deployment Architecture: Agent + Gateway¶
For production Kubernetes, the recommended pattern uses two tiers of collectors:
┌──────────────────────────────────────────────────────────┐
│ Node 1 Node 2 │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ App Pod │ │ App Pod │ │ App Pod │ │ App Pod │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ └──────┬────┘ └──────┬────┘ │
│ ┌────▼─────┐ ┌─────▼────┐ │
│ │ Agent │ │ Agent │ DaemonSet │
│ │Collector │ │Collector │ (per-node) │
│ └────┬─────┘ └────┬─────┘ │
│ └──────────┬───────────┘ │
│ ┌────▼────────┐ │
│ │ Gateway │ Deployment │
│ │ Collector │ (2-3 replicas) │
│ │ (tail samp, │ │
│ │ enrichment)│ │
│ └────┬───────┘ │
│ │ │
│ ┌────▼───────┐ │
│ │ Backend │ Tempo / Jaeger │
│ └────────────┘ │
└──────────────────────────────────────────────────────────┘
Agents (DaemonSet): lightweight, handle basic processing (memory limiting, batching), forward to the gateway. If an agent crashes, only one node's telemetry is affected.
Gateway (Deployment): handles expensive operations — tail sampling, k8s attribute enrichment, fan-out to multiple backends. Scaled horizontally.
War Story: A fintech company ran a single OTel Collector pod. During a production incident, the collector pod was evicted due to node resource pressure — exactly when they needed traces to diagnose the root cause. They had zero telemetry for the 12-minute window that mattered most. The fix: DaemonSet deployment with
priorityClassName: system-node-criticalto prevent eviction, plus a multi-replica gateway behind a headless service.
Back to the Mission: Finding the Bottleneck¶
Let's solve the checkout slowness. You pull up Jaeger and search for slow traces:
# Find slow checkout traces (> 3 seconds)
curl -s "http://jaeger:16686/api/traces?service=api-gateway\
&operation=POST+%2Fapi%2Fcheckout&limit=10&minDuration=3000000" \
| jq '.data[0].traceID'
# → "4bf92f3577b34da6a3ce929d0e0e4736"
You open the trace. The waterfall view shows:
api-gateway ████████████████████████████████████████ 4210ms
cart-service ██ 85ms
inventory-svc ██████████████████████████████████████ 3802ms
cache.lookup ▏ 4ms
db.query █████████████████████████████████████ 3791ms
payment-svc █████ 180ms
stripe.call ████ 142ms
notif-svc █ 38ms
kafka.produce▏ 12ms
The culprit is inventory-svc -> db.query at 3,791ms. You check the span attributes:
{
"db.system": "postgresql",
"db.statement": "SELECT * FROM inventory WHERE product_id IN ($1,$2,...,$347)",
"db.operation": "SELECT",
"db.rows_affected": 347
}
A sequential scan on 347 product IDs. You check across 10 slow traces — same pattern: bulk orders with 200+ items always hit the slow path.
The fix: CREATE INDEX CONCURRENTLY ON inventory (product_id);. No downtime, no
maintenance window. PostgreSQL builds the index while serving queries.
But you only found this in minutes because the trace showed you exactly where the time went. Without tracing, you'd be checking each service's metrics, guessing, and SSH-ing into pods.
Interview Bridge: "Walk me through how you'd debug a slow API response in a microservices environment" is a common interview question. The answer that separates juniors from seniors: juniors check each service's metrics individually. Seniors pull a distributed trace and look at the span waterfall. Traces give you the answer in seconds; per-service metric hunting takes hours.
The Environment Variable Trap¶
OTel SDKs read configuration from environment variables. But typos are silent — there is no error, just no data:
# Correct
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
OTEL_SERVICE_NAME=inventory-service
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,service.version=3.2.1
# Wrong — and you will get zero errors, zero warnings, just no data
OTEL_EXPORTER_ENDPOINT=http://collector:4317 # Missing 'OTLP'
OTEL_SERVICE=inventory-service # Missing '_NAME'
OTEL_RESOURCE_ATTRS=deployment.environment=production # Wrong suffix
If your service shows up in Jaeger as unknown_service, the env var is wrong.
Remember: Always validate with the debug exporter first. Add
debugto your pipeline's exporters list and check the collector's stdout. If you see your spans with the right service name and attributes in the debug output, the instrumentation is correct and the problem is downstream (exporter, backend, network).
Semantic Conventions: The Shared Vocabulary¶
OTel defines standardized attribute names so that http.request.method means the same
thing whether it comes from a Go service or a Python service:
| Namespace | Example Attributes | When You'll See Them |
|---|---|---|
http. |
http.request.method, http.response.status_code |
Every HTTP span |
db. |
db.system, db.statement, db.operation |
Database queries |
rpc. |
rpc.system, rpc.method, rpc.service |
gRPC/Thrift calls |
messaging. |
messaging.system, messaging.operation |
Kafka/RabbitMQ spans |
k8s. |
k8s.pod.name, k8s.namespace.name |
After k8sattributes processor |
service. |
service.name, service.version |
Resource attributes (required) |
If your team invents httpMethod instead of http.request.method, every cross-service
dashboard query breaks because it expects the standard name. Semantic conventions are not
optional — they are what make cross-service correlation work.
Flashcard Check #3¶
| Question | Answer |
|---|---|
What is the difference between the OTel Collector core and contrib distributions? |
Core has minimal components; contrib includes processors/exporters for Loki, Kafka, AWS, k8s attributes, and dozens more. Use contrib unless you have a specific reason not to. |
What happens if you set OTEL_SERVICE instead of OTEL_SERVICE_NAME? |
Nothing — the SDK ignores the unrecognized variable silently, and your service appears as unknown_service |
| How does Grafana Tempo find traces without indexing? | By trace ID (from a log line, metric exemplar, or TraceQL query). It trades searchability for dramatically lower storage cost. |
| What are the two tiers in the Agent + Gateway collector pattern? | Agents (DaemonSet, per-node, lightweight) and Gateway (Deployment, centralized, handles tail sampling and enrichment) |
Exercises¶
Exercise 1: Read a traceparent Header (2 minutes)¶
Given this header:
- What is the trace ID?
- What is the parent span ID?
- Is this trace sampled?
Answer
1. `0af7651916cd43dd8448eb211c80319c` (the 32-hex-char field) 2. `b7ad6b7169203331` (the 16-hex-char field) 3. Yes — the flags field is `01` (sampled)Exercise 2: Fix the Collector Config (5 minutes)¶
This collector config has three bugs. Find them:
processors:
batch:
send_batch_size: 1024
timeout: 5s
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
debug:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, memory_limiter]
exporters: [debug]
Answer
1. **Processor order is wrong.** `memory_limiter` must come before `batch`. Reverse them: `processors: [memory_limiter, batch]` 2. **Debug verbosity is `detailed` in what could be a production config.** In production, this generates gigabytes of log output per hour. Use `basic` or remove the debug exporter entirely. 3. **No OTLP receiver is defined.** The pipeline references `[otlp]` as a receiver, but the receivers section is missing. Add:Exercise 3: Design a Sampling Strategy (10 minutes)¶
Your company processes 50,000 requests/second across 30 microservices. Your tracing backend budget is $3,000/month. At current ingestion rates with 100% sampling, you'd spend $45,000/month.
Design a tail-sampling policy that: - Keeps all error traces - Keeps all traces over 3 seconds - Keeps all traces touching the payment service - Stays within budget
Hint
You need to reduce volume by roughly 93% (from $45k to $3k). That means your baseline probabilistic sampling should be around 5-7%. Calculate: errors + slow traces + payment traces probably account for 2-5% of total traffic, so your probabilistic sample of the remaining 95% needs to bring total volume to ~7% of original.Solution
processors:
tail_sampling:
decision_wait: 15s
num_traces: 200000
policies:
- name: keep-errors
type: status_code
status_code:
status_codes: [ERROR]
- name: keep-slow
type: latency
latency:
threshold_ms: 3000
- name: keep-payments
type: string_attribute
string_attribute:
key: service.name
values: [payment-service]
- name: baseline
type: probabilistic
probabilistic:
sampling_percentage: 3
Cheat Sheet¶
| What | Command / Config | Notes |
|---|---|---|
| Check collector health | curl http://localhost:13133/health |
Only confirms process is running, not pipeline health |
| Check spans received | curl localhost:8888/metrics \| grep otelcol_receiver_accepted_spans |
0 = apps not sending |
| Check spans exported | curl localhost:8888/metrics \| grep otelcol_exporter_sent_spans |
Compare with received — gap = drops |
| Check for drops | curl localhost:8888/metrics \| grep -E "(dropped\|refused\|failed)" |
Non-zero during incidents = you're losing data |
| Verify OTLP ports | ss -tlnp \| grep -E '(4317\|4318)' |
4317=gRPC, 4318=HTTP |
| Validate config | otelcol validate --config=config.yaml |
Run before applying to cluster |
| Live pipeline debug | curl http://localhost:55679/debug/pipelinez |
Requires zpages extension |
| Set service name | OTEL_SERVICE_NAME=my-service |
Most common env var to get wrong |
| Set propagator | OTEL_PROPAGATORS=tracecontext,baggage |
Default is W3C; set explicitly if mixing with B3/Zipkin |
| Core vs contrib | Image tag: otel/opentelemetry-collector-contrib |
Use contrib unless you know you only need core |
traceparent Quick Reference¶
00-{trace-id-32hex}-{parent-id-16hex}-{flags-2hex}
▲ ▲ ▲
128-bit 64-bit 01=sampled
shared by unique per 00=not sampled
all spans span
Processor Order¶
memory_limiter → k8sattributes → resource → filter → transform → tail_sampling → batch
guard enrich enrich reduce reshape select optimize
Takeaways¶
-
Traces show causality. Metrics tell you something is slow. Traces tell you which span in which service is slow — in seconds, not hours.
-
Context propagation is the linchpin. One uninstrumented service, one missing header, one async boundary without manual injection — and your trace splits into useless fragments. Verify propagation at every boundary.
-
Tail sampling beats head sampling for debugging. Head sampling is blind — it might drop the one trace you need. Tail sampling keeps errors and slow traces by design. The cost is memory in the collector.
-
The collector is a pipeline, not a proxy. Receivers, processors, exporters — and the order of processors matters. Guard (memory limiter) first, optimize (batch) last.
-
OTLP is the universal language. Instrument with OTel, export via OTLP, and you can switch backends without re-instrumenting. This is the real vendor-neutral promise.
-
Correlate all three signals. A trace ID in your logs, an exemplar in your metrics, a span linked to a log entry. The three pillars are powerful alone; they are transformative together.
Related Lessons¶
- The Mysterious Latency Spike — when the latency problem is in the infrastructure layer (CPU throttling, GC, disk I/O, noisy neighbors) rather than the application layer
- The Cascading Timeout — what happens when one slow service brings down the entire platform via thread pool exhaustion and retry storms
- The Monitoring That Lied — when your observability stack itself is the problem: dashboards showing green while production burns