Distributed Tracing - Street-Level Ops¶
Real-world workflows for debugging latency and failures using distributed tracing.
Find a trace from an error log¶
# Step 1: Extract trace ID from structured logs
kubectl logs -l app=api-server -n production --since=1h | \
jq -r 'select(.level == "error") | "\(.timestamp) \(.trace_id) \(.error)"' | head -5
# 2024-03-14T03:22:41Z abc123def456 connection refused: upstream db-primary:5432
# 2024-03-14T03:23:01Z 789ghi012jkl timeout waiting for payment-svc
# Step 2: Open trace in Jaeger UI
# http://jaeger.monitoring:16686/trace/abc123def456
# Step 3: Or query Jaeger API directly
curl -s "http://jaeger.monitoring:16686/api/traces/abc123def456" | \
jq '.data[0].spans[] | {service: .processID, operation: .operationName, duration: .duration, error: (.tags[] | select(.key == "error") | .value)}'
Query Jaeger for slow traces¶
# Find traces over 5 seconds in the last hour
curl -s "http://jaeger.monitoring:16686/api/traces?service=api-gateway&minDuration=5000000us&lookback=1h&limit=20" | \
jq '.data[] | {traceID: .traceID, spans: .spans | length, duration: (.spans[0].duration / 1000 | floor | tostring + "ms")}'
# {"traceID":"abc123","spans":8,"duration":"7234ms"}
# {"traceID":"def456","spans":12,"duration":"12001ms"}
# Find traces with errors
curl -s "http://jaeger.monitoring:16686/api/traces?service=api-gateway&tags=%7B%22error%22%3A%22true%22%7D&lookback=1h&limit=20" | \
jq '.data[].traceID'
Check OpenTelemetry Collector health¶
# Verify the collector is running
kubectl get pods -n observability -l app=otel-collector
# NAME READY STATUS RESTARTS
# otel-collector-7b9d4c5f6-x2k8q 1/1 Running 0
# Check collector metrics
kubectl port-forward -n observability svc/otel-collector 8888:8888
curl -s http://localhost:8888/metrics | grep -E "otelcol_receiver_accepted|otelcol_exporter_sent"
# otelcol_receiver_accepted_spans_total{receiver="otlp"} 142857
# otelcol_exporter_sent_spans_total{exporter="jaeger"} 142850
# ← 7 dropped spans, investigate if this grows
# Check collector logs for errors
kubectl logs -n observability -l app=otel-collector --tail=30 | grep -i error
Debug clue: When
otelcol_receiver_accepted_spansandotelcol_exporter_sent_spansdiverge significantly, checkotelcol_processor_dropped_spansto find which processor is dropping them. Common cause: a tail-sampling processor running out of memory because itsdecision_waitwindow is too long for your span volume.
Verify trace propagation across services¶
# Send a test request and follow the trace
TRACE_ID=$(python3 -c "import uuid; print(uuid.uuid4().hex)")
curl -v -H "traceparent: 00-${TRACE_ID}-0000000000000001-01" \
http://api-gateway.production:8080/api/health
# The traceparent header propagates through the call chain
# Check that all services reported spans for this trace
sleep 5 # wait for spans to flush
curl -s "http://jaeger.monitoring:16686/api/traces/${TRACE_ID}" | \
jq '.data[0].spans[] | {service: .processID, operation: .operationName}'
# {"service":"api-gateway","operation":"HTTP GET /api/health"}
# {"service":"auth-svc","operation":"validateToken"}
# {"service":"user-svc","operation":"getUser"}
Debug a broken trace (missing spans)¶
# Trace shows api-gateway -> auth-svc but then stops
# auth-svc should call user-svc but there is no span
# 1. Check if user-svc has OTel instrumentation
kubectl exec -n production deploy/user-svc -- env | grep OTEL
# OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.observability:4317
# OTEL_SERVICE_NAME=user-svc
# OTEL_TRACES_SAMPLER=parentbased_traceidratio
# 2. Check if the SDK is loaded
kubectl logs -n production -l app=user-svc --tail=20 | grep -i "opentelemetry\|otel\|tracer"
# 3. Check if spans are reaching the collector
kubectl logs -n observability -l app=otel-collector --tail=50 | grep "user-svc"
# 4. Common cause: HTTP client not instrumented
# The app makes outbound calls with a raw HTTP client that does not propagate headers
Gotcha: The most common cause of broken traces is context propagation failure. Auto-instrumentation handles inbound requests, but outbound HTTP calls need the
traceparentheader injected. In Go, useotelhttp.NewTransport(); in Python,requestsneeds theopentelemetry-instrumentation-requestspackage. One un-instrumented HTTP client breaks the entire downstream trace chain.
Grafana Tempo query by trace ID¶
# Query Tempo API directly
curl -s "http://tempo.monitoring:3200/api/traces/abc123def456" | \
jq '.batches[].scopeSpans[].spans[] | {name: .name, duration: (.endTimeUnixNano - .startTimeUnixNano) / 1000000 | floor | tostring + "ms"}'
# Search traces by service and duration (TraceQL)
curl -s "http://tempo.monitoring:3200/api/search?q=%7Bspan.http.status_code%3D500%7D&limit=10" | \
jq '.traces[] | {traceID, rootServiceName, durationMs}'
Check sampling configuration¶
# What sampling rate is configured?
kubectl get configmap otel-collector-config -n observability -o yaml | \
grep -A10 "processors:" | grep -A5 "probabilistic_sampler"
# probabilistic_sampler:
# sampling_percentage: 10 ← 10% of traces kept
# Check if tail sampling is configured
kubectl get configmap otel-collector-config -n observability -o yaml | \
grep -A15 "tail_sampling"
# tail_sampling:
# policies:
# - name: error-policy
# type: status_code
# status_code: {status_codes: [ERROR]}
# - name: latency-policy
# type: latency
# latency: {threshold_ms: 5000}
Under the hood: Head-based sampling (decided at trace start) is cheap but blind — it drops traces before knowing if they are interesting. Tail-based sampling (decided after all spans arrive) captures all errors and slow traces but requires the collector to buffer complete traces in memory. For a service doing 10K req/s, tail sampling at a 30-second decision window needs ~300K traces in memory. Size your collector accordingly.
Service dependency map from traces¶
# Jaeger service dependencies
curl -s "http://jaeger.monitoring:16686/api/dependencies?endTs=$(date +%s000)&lookback=86400000" | \
jq '.data[] | "\(.parent) -> \(.child) (\(.callCount) calls)"'
# "api-gateway -> auth-svc (14523 calls)"
# "api-gateway -> user-svc (12001 calls)"
# "api-gateway -> payment-svc (3456 calls)"
# "payment-svc -> stripe-api (3450 calls)"
Correlate traces with metrics¶
# Find the trace ID from a spike in latency dashboard
# In Grafana: use the "Traces" data source linked to your metrics panel
# Click on a latency spike → "View traces" → opens trace search for that time window
# Manual correlation: find traces during a metric spike
curl -s "http://jaeger.monitoring:16686/api/traces?service=api-gateway&start=$(date -d '10 min ago' +%s)000000&end=$(date +%s)000000&minDuration=5000000us&limit=10" | \
jq '.data[].traceID'
Deploy Jaeger all-in-one for debugging¶
# Quick deployment for a dev/staging environment
kubectl create namespace tracing
kubectl apply -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: tracing
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
ports:
- containerPort: 16686
- containerPort: 4317
- containerPort: 4318
---
apiVersion: v1
kind: Service
metadata:
name: jaeger
namespace: tracing
spec:
selector:
app: jaeger
ports:
- name: ui
port: 16686
- name: otlp-grpc
port: 4317
- name: otlp-http
port: 4318
EOF
# Port forward to access UI
kubectl port-forward -n tracing svc/jaeger 16686:16686
# Open http://localhost:16686
Default trap: Jaeger all-in-one uses in-memory storage by default — all traces are lost on restart. For staging, set
SPAN_STORAGE_TYPE=badgerfor persistent local storage. For production, use a dedicated backend (Elasticsearch, Cassandra, or Tempo with object storage).