Distributed Tracing Footguns¶

Mistakes that break trace chains, overwhelm storage, or hide the real problem.

1. One uninstrumented service breaks the entire trace¶

You have 12 services fully instrumented. Service #7 (an internal proxy) was never instrumented. Every trace that passes through it splits into two disconnected fragments. You see the frontend half and the backend half but cannot connect them.

Fix: Instrument every service in the request path, even simple proxies. At minimum, propagate the traceparent header. Use OpenTelemetry auto-instrumentation for low-effort coverage.

2. Sampling 100% of traces in production¶

You set sampling to 100% because "we might need any trace." Your tracing backend receives 50 million spans per hour. Storage fills up in days. Query performance collapses. The tracing system itself becomes the incident.

Fix: Use head-based sampling at 1-10% for normal traffic. Add tail-based sampling to keep 100% of errors and slow traces. Prioritize high-value paths (payment, auth) at higher rates.

Scale note: At 10,000 traces/sec with an average trace size of 50KB, you generate ~43TB/month of trace data. At typical observability vendor pricing ($0.30-1.50/GB ingested), that is $13,000-65,000/month. Sampling at 5% reduces this to $650-3,250/month while keeping 100% of errors via tail-based sampling.

3. Trace context not propagated through message queues¶

Your service publishes to Kafka. The consumer processes the message. The trace stops at the publisher. The consumer starts a new, disconnected trace. You cannot follow a request from HTTP through the async pipeline.

Fix: Inject trace context into message headers when publishing. Extract it when consuming. OpenTelemetry has instrumentation for Kafka, RabbitMQ, and SQS that handles this automatically.

4. Clock skew makes traces unreadable¶

Your trace shows a child span starting before its parent. The timeline is garbled. Spans overlap in impossible ways. The cause: host clocks are off by seconds because NTP is not running or is misconfigured.

Fix: Ensure NTP is running on all hosts: timedatectl status. Use chrony or systemd-timesyncd. Verify clock sync is within milliseconds. Container hosts and VMs are especially prone to drift.

Debug clue: In Jaeger or Grafana Tempo, if you see child spans starting before parent spans, that is clock skew, not a bug in your code. Check chronyc sources -v on both hosts. VMs after live migration and containers on overloaded hosts are the most common offenders.

5. Sensitive data in span attributes¶

Your instrumentation records db.statement with the full SQL query, including user passwords in WHERE clauses. Span attributes show http.url with API keys in query parameters. Anyone with Jaeger access can see secrets.

Fix: Sanitize span attributes. Drop or redact query parameters, request bodies, and SQL values. Configure OTel SDK to exclude sensitive attributes. Review what your auto-instrumentation captures.

6. Wrong service name across environments¶

Dev, staging, and prod all report spans with service.name=api-gateway. Your Jaeger instance receives traces from all environments. Service dependency maps are nonsensical. You trace a production error and find dev spans mixed in.

Fix: Include environment in the service name or use resource attributes: OTEL_SERVICE_NAME=api-gateway with OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production. Filter by environment in queries.

7. Collector as a single point of failure¶

Your single OTel Collector instance crashes. All services buffer spans locally until they overflow. Spans are dropped. During the outage, three errors occur that you will never be able to trace.

Fix: Run the collector as a DaemonSet or deploy multiple replicas behind a load balancer. Configure retry and queue settings in the exporter. Use a gateway collector pattern for resilience.

Gotcha: The OTel SDK's default exporter queue size is 2048 spans. When the collector is down, spans queue in the SDK. Once the queue fills, new spans are dropped silently. The otelcol_exporter_send_failed_spans metric tells you if your collector is dropping data. If you see this during an incident, you're losing exactly the traces you need most.

8. Ignoring span status — treating all spans as successful¶

Your instrumentation creates spans but never sets the status to Error when exceptions occur. In Jaeger, everything looks green. You cannot search for failed traces because no span is marked as an error.

Fix: Set span status to Error on exceptions. Record the exception message and stack trace as span events. In OpenTelemetry: span.set_status(StatusCode.ERROR, "description") and span.record_exception(e).

9. Too many low-value spans¶

Every function call is a span. A single request creates 200 spans. Trace views are unreadable. Storage costs 10x what they should be. Finding the actual problem in a wall of spans takes longer than grepping logs.

Fix: Instrument at service boundaries (HTTP calls, database queries, queue operations), not internal functions. Use span events or logs for internal detail. Quality over quantity.

10. No connection between traces and logs¶

Your tracing backend has the trace. Your logging backend has the error details. There is no trace ID in the log line. Jumping from "I see a failed span" to "what exactly went wrong" requires timestamp correlation and guesswork.

Fix: Inject trace_id and span_id into every structured log line. Configure your log viewer (Kibana, Grafana Loki) to link directly to the trace backend. OpenTelemetry SDKs can inject trace context into log records automatically.