Distributed Tracing - Street-Level Ops¶

Real-world workflows for debugging latency and failures using distributed tracing.

Find a trace from an error log¶

# Step 1: Extract trace ID from structured logs
kubectl logs -l app=api-server -n production --since=1h | \
  jq -r 'select(.level == "error") | "\(.timestamp) \(.trace_id) \(.error)"' | head -5
# 2024-03-14T03:22:41Z abc123def456 connection refused: upstream db-primary:5432
# 2024-03-14T03:23:01Z 789ghi012jkl timeout waiting for payment-svc

# Step 2: Open trace in Jaeger UI
# http://jaeger.monitoring:16686/trace/abc123def456

# Step 3: Or query Jaeger API directly
curl -s "http://jaeger.monitoring:16686/api/traces/abc123def456" | \
  jq '.data[0].spans[] | {service: .processID, operation: .operationName, duration: .duration, error: (.tags[] | select(.key == "error") | .value)}'

Query Jaeger for slow traces¶

# Find traces over 5 seconds in the last hour
curl -s "http://jaeger.monitoring:16686/api/traces?service=api-gateway&minDuration=5000000us&lookback=1h&limit=20" | \
  jq '.data[] | {traceID: .traceID, spans: .spans | length, duration: (.spans[0].duration / 1000 | floor | tostring + "ms")}'
# {"traceID":"abc123","spans":8,"duration":"7234ms"}
# {"traceID":"def456","spans":12,"duration":"12001ms"}

# Find traces with errors
curl -s "http://jaeger.monitoring:16686/api/traces?service=api-gateway&tags=%7B%22error%22%3A%22true%22%7D&lookback=1h&limit=20" | \
  jq '.data[].traceID'

Check OpenTelemetry Collector health¶

# Verify the collector is running
kubectl get pods -n observability -l app=otel-collector
# NAME                              READY   STATUS    RESTARTS
# otel-collector-7b9d4c5f6-x2k8q   1/1     Running   0

# Check collector metrics
kubectl port-forward -n observability svc/otel-collector 8888:8888
curl -s http://localhost:8888/metrics | grep -E "otelcol_receiver_accepted|otelcol_exporter_sent"
# otelcol_receiver_accepted_spans_total{receiver="otlp"} 142857
# otelcol_exporter_sent_spans_total{exporter="jaeger"} 142850
# ← 7 dropped spans, investigate if this grows

# Check collector logs for errors
kubectl logs -n observability -l app=otel-collector --tail=30 | grep -i error

Debug clue: When otelcol_receiver_accepted_spans and otelcol_exporter_sent_spans diverge significantly, check otelcol_processor_dropped_spans to find which processor is dropping them. Common cause: a tail-sampling processor running out of memory because its decision_wait window is too long for your span volume.

Verify trace propagation across services¶

# Send a test request and follow the trace
TRACE_ID=$(python3 -c "import uuid; print(uuid.uuid4().hex)")
curl -v -H "traceparent: 00-${TRACE_ID}-0000000000000001-01" \
  http://api-gateway.production:8080/api/health
# The traceparent header propagates through the call chain

# Check that all services reported spans for this trace
sleep 5  # wait for spans to flush
curl -s "http://jaeger.monitoring:16686/api/traces/${TRACE_ID}" | \
  jq '.data[0].spans[] | {service: .processID, operation: .operationName}'
# {"service":"api-gateway","operation":"HTTP GET /api/health"}
# {"service":"auth-svc","operation":"validateToken"}
# {"service":"user-svc","operation":"getUser"}

Debug a broken trace (missing spans)¶

# Trace shows api-gateway -> auth-svc but then stops
# auth-svc should call user-svc but there is no span

# 1. Check if user-svc has OTel instrumentation
kubectl exec -n production deploy/user-svc -- env | grep OTEL
# OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.observability:4317
# OTEL_SERVICE_NAME=user-svc
# OTEL_TRACES_SAMPLER=parentbased_traceidratio

# 2. Check if the SDK is loaded
kubectl logs -n production -l app=user-svc --tail=20 | grep -i "opentelemetry\|otel\|tracer"

# 3. Check if spans are reaching the collector
kubectl logs -n observability -l app=otel-collector --tail=50 | grep "user-svc"

# 4. Common cause: HTTP client not instrumented
# The app makes outbound calls with a raw HTTP client that does not propagate headers

Gotcha: The most common cause of broken traces is context propagation failure. Auto-instrumentation handles inbound requests, but outbound HTTP calls need the traceparent header injected. In Go, use otelhttp.NewTransport(); in Python, requests needs the opentelemetry-instrumentation-requests package. One un-instrumented HTTP client breaks the entire downstream trace chain.

Grafana Tempo query by trace ID¶

# Query Tempo API directly
curl -s "http://tempo.monitoring:3200/api/traces/abc123def456" | \
  jq '.batches[].scopeSpans[].spans[] | {name: .name, duration: (.endTimeUnixNano - .startTimeUnixNano) / 1000000 | floor | tostring + "ms"}'

# Search traces by service and duration (TraceQL)
curl -s "http://tempo.monitoring:3200/api/search?q=%7Bspan.http.status_code%3D500%7D&limit=10" | \
  jq '.traces[] | {traceID, rootServiceName, durationMs}'

Check sampling configuration¶

# What sampling rate is configured?
kubectl get configmap otel-collector-config -n observability -o yaml | \
  grep -A10 "processors:" | grep -A5 "probabilistic_sampler"
# probabilistic_sampler:
#   sampling_percentage: 10    ← 10% of traces kept

# Check if tail sampling is configured
kubectl get configmap otel-collector-config -n observability -o yaml | \
  grep -A15 "tail_sampling"
# tail_sampling:
#   policies:
#     - name: error-policy
#       type: status_code
#       status_code: {status_codes: [ERROR]}
#     - name: latency-policy
#       type: latency
#       latency: {threshold_ms: 5000}

Under the hood: Head-based sampling (decided at trace start) is cheap but blind — it drops traces before knowing if they are interesting. Tail-based sampling (decided after all spans arrive) captures all errors and slow traces but requires the collector to buffer complete traces in memory. For a service doing 10K req/s, tail sampling at a 30-second decision window needs ~300K traces in memory. Size your collector accordingly.

Service dependency map from traces¶

# Jaeger service dependencies
curl -s "http://jaeger.monitoring:16686/api/dependencies?endTs=$(date +%s000)&lookback=86400000" | \
  jq '.data[] | "\(.parent) -> \(.child) (\(.callCount) calls)"'
# "api-gateway -> auth-svc (14523 calls)"
# "api-gateway -> user-svc (12001 calls)"
# "api-gateway -> payment-svc (3456 calls)"
# "payment-svc -> stripe-api (3450 calls)"

Correlate traces with metrics¶

# Find the trace ID from a spike in latency dashboard
# In Grafana: use the "Traces" data source linked to your metrics panel
# Click on a latency spike → "View traces" → opens trace search for that time window

# Manual correlation: find traces during a metric spike
curl -s "http://jaeger.monitoring:16686/api/traces?service=api-gateway&start=$(date -d '10 min ago' +%s)000000&end=$(date +%s)000000&minDuration=5000000us&limit=10" | \
  jq '.data[].traceID'

Deploy Jaeger all-in-one for debugging¶

# Quick deployment for a dev/staging environment
kubectl create namespace tracing
kubectl apply -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: tracing
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - name: jaeger
        image: jaegertracing/all-in-one:latest
        ports:
        - containerPort: 16686
        - containerPort: 4317
        - containerPort: 4318
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger
  namespace: tracing
spec:
  selector:
    app: jaeger
  ports:
  - name: ui
    port: 16686
  - name: otlp-grpc
    port: 4317
  - name: otlp-http
    port: 4318
EOF

# Port forward to access UI
kubectl port-forward -n tracing svc/jaeger 16686:16686
# Open http://localhost:16686

Default trap: Jaeger all-in-one uses in-memory storage by default — all traces are lost on restart. For staging, set SPAN_STORAGE_TYPE=badger for persistent local storage. For production, use a dedicated backend (Elasticsearch, Cassandra, or Tempo with object storage).

Distributed Tracing - Street-Level Ops¶

Find a trace from an error log¶

Query Jaeger for slow traces¶

Check OpenTelemetry Collector health¶

Verify trace propagation across services¶

Debug a broken trace (missing spans)¶

Grafana Tempo query by trace ID¶

Check sampling configuration¶

Service dependency map from traces¶

Correlate traces with metrics¶

Deploy Jaeger all-in-one for debugging¶

Pages that link here¶