Thinking Out Loud: OpenTelemetry¶
A senior SRE's internal monologue while working through a real observability task. This isn't a tutorial — it's a window into how experienced engineers actually think.
The Situation¶
Users report that order submissions are intermittently slow — 3-5 seconds instead of the usual 300ms. The order flow involves 6 microservices. Traditional metrics show no single service is obviously slow. I need to use distributed tracing to find the bottleneck.
The Monologue¶
Intermittent slowness across 6 services, no obvious culprit in metrics. This is exactly what distributed tracing was built for. A trace will show me exactly which service and which span is consuming the time. Let me pull some slow traces.
curl -s "http://jaeger:16686/api/traces?service=order-gateway&operation=POST+%2Fapi%2Forders&limit=10&minDuration=3000000" | jq '.data[0].traceID'
Got a trace ID. Let me look at the full trace.
curl -s "http://jaeger:16686/api/traces/abc123def456" | jq '.data[0].spans[] | {service: .process.serviceName, operation: .operationName, duration_ms: (.duration/1000)}' | head -30
The trace shows: order-gateway (10ms) -> order-validator (45ms) -> inventory-check (2800ms) -> payment-processor (150ms) -> notification-service (50ms). There it is — inventory-check is taking 2.8 seconds. That's where the time goes.
But before I go yelling at the inventory team, let me check multiple slow traces. Is it always inventory-check?
curl -s "http://jaeger:16686/api/traces?service=order-gateway&operation=POST+%2Fapi%2Forders&limit=20&minDuration=3000000" | jq '[.data[].spans[] | select(.duration > 1000000) | {service: .process.serviceName, duration_ms: (.duration/1000)}] | group_by(.service) | map({service: .[0].service, avg_ms: (map(.duration_ms) | add / length), count: length})'
Across 20 slow traces, inventory-check appears in all 20 with an average duration of 2.5 seconds. Payment-processor appears in 3 with 800ms. So inventory-check is the primary culprit, but payment is occasionally slow too.
Let me drill into a specific inventory-check span to see what it's doing internally.
curl -s "http://jaeger:16686/api/traces/abc123def456" | jq '.data[0].spans[] | select(.process.serviceName == "inventory-check") | {operation: .operationName, duration_ms: (.duration/1000), tags: [.tags[] | {key: .key, value: .value}]}'
Mental Model: Trace-Guided Debugging¶
Distributed tracing replaces "guessing which service is slow" with "seeing exactly where time is spent." The waterfall view shows sequential and parallel operations. Look for: (1) one span dominating the trace duration, (2) sequential calls that could be parallel, (3) spans with error tags, and (4) gaps between spans (indicates queueing or network latency).
The inventory-check span has child spans: db.query (2600ms) and cache.lookup (5ms). The cache missed and the DB query took 2.6 seconds. Let me check the query.
Looking at the span tags: db.statement: SELECT * FROM inventory WHERE product_id IN (...), db.type: postgresql. And the IN clause has... 350 product IDs. That's a bulk order checking 350 items against the inventory table.
A SELECT with 350 IDs isn't inherently slow. Let me check if there's an index issue. Actually, from the trace I can see there's more information.
curl -s "http://jaeger:16686/api/traces/abc123def456" | jq '.data[0].spans[] | select(.operationName == "db.query") | .tags[] | select(.key | startswith("db."))'
db.rows_affected: 350, db.query_plan: Seq Scan. Sequential scan on the inventory table. There's no index on product_id. For small orders (5-10 items), the seq scan is fast enough. For bulk orders (350 items), it's brutal.
But wait — is this something that got worse recently, or has it always been slow for bulk orders? Let me check if the trace data shows a trend.
curl -s 'http://prometheus:9090/api/v1/query_range?query=histogram_quantile(0.99,rate(inventory_check_duration_seconds_bucket[5m]))&start='"$(date -d '7 days ago' +%s)"'&end='"$(date +%s)"'&step=3600' | jq '.data.result[0].values | .[-24:][] | .[1]'
P99 has been creeping up over the past week. It was 800ms a week ago, now it's 2.5 seconds. The table is growing and the seq scan is getting slower.
Mental Model: Traces for Diagnosis, Metrics for Trends¶
Traces tell you WHERE the problem is and WHAT it's doing. Metrics tell you WHEN it started and HOW FAST it's getting worse. Use traces to diagnose, metrics to trend. If you only have traces, you see the current state. If you only have metrics, you know there's a problem but not where. Together, they give you the full picture.
Okay, the fix is an index on the inventory table's product_id column. But I'm an SRE, not a DBA. Let me coordinate with the inventory team.
First, let me check if this is blocking enough users to warrant an emergency fix or if it can wait for the next sprint.
curl -s 'http://prometheus:9090/api/v1/query?query=sum(rate(order_submissions_total{status="slow"}[1h]))/sum(rate(order_submissions_total[1h]))' | jq '.data.result[0].value[1]'
8% of orders are taking more than 3 seconds. That's significant enough to justify an immediate fix. Let me also check if we can add the index without downtime.
For PostgreSQL, CREATE INDEX CONCURRENTLY adds an index without locking the table. The inventory team can do this without a maintenance window.
While they add the index, let me see if there's a quick caching fix on the application side. The cache is already there (I saw the cache.lookup span), but it's missing. Why?
curl -s "http://jaeger:16686/api/traces/abc123def456" | jq '.data[0].spans[] | select(.operationName == "cache.lookup") | .tags[]'
cache.hit: false, cache.key: inventory:bulk:hash_abc123. The cache key is based on the exact set of product IDs. A bulk order with a unique combination of products will always miss the cache. The cache is effective for individual product lookups but useless for unique bulk combinations.
The caching strategy needs rethinking, but that's a design discussion, not an incident fix. For now, the index will solve the performance issue.
Let me also make sure we have proper instrumentation going forward. I want to see the number of items in the inventory check as a span attribute, so we can correlate batch size with latency.
# Check OTel collector config for any sampling that might hide slow traces
kubectl get configmap otel-collector-config -n monitoring -o yaml | grep -A 10 "sampling\|probabilistic"
The collector is using probabilistic sampling at 10%. That means we're only seeing 10% of traces. For debugging, that's fine, but we might be missing patterns. Let me also add a tail-based sampling rule that keeps 100% of slow traces.
I'll update the collector config to add tail-based sampling for traces > 2 seconds. That way, we always capture slow traces even if the probabilistic sampler would have dropped them.
For now, I'll notify the inventory team about the missing index, confirm the 8% impact, and recommend they add the index with CREATE INDEX CONCURRENTLY today.
What Made This Senior-Level¶
| Junior Would... | Senior Does... | Why |
|---|---|---|
| Check each service's metrics individually, spending hours | Pull slow traces from Jaeger and immediately see which span is slow | Distributed tracing gives you the answer in seconds instead of hours |
| Look at the slow span and say "inventory is slow" | Drill into the span's child spans to find the specific DB query and missing index | The service isn't slow — a specific query is slow because of a missing index |
| Only look at the current state via traces | Combine traces (diagnosis) with metrics (trend) to see how long the issue has existed and how fast it's worsening | Trend data shapes the urgency and the fix — is this getting worse or stable? |
| Not think about sampling configuration | Check the OTel collector sampling config and add tail-based sampling for slow traces | Probabilistic sampling at 10% means you lose 90% of your debugging data. Keep 100% of slow traces. |
Key Heuristics Used¶
- Trace-Guided Debugging: Pull slow traces, find the dominant span, drill into its child spans and tags. The trace shows you exactly where time is spent.
- Traces + Metrics = Full Picture: Traces tell you WHERE and WHAT. Metrics tell you WHEN and HOW MUCH. Use both together for complete diagnosis.
- Tail-Based Sampling for Interesting Traces: Always keep 100% of slow or errored traces. Probabilistic sampling is fine for normal traffic but loses the data you need most.
Cross-References¶
- Primer — OpenTelemetry data model, trace/span/metric relationships, and collector architecture
- Street Ops — Jaeger queries, OTel collector debugging, and instrumentation patterns
- Footguns — Over-sampling dropping important traces, missing span attributes, and tracing without context propagation