Skip to content

OpenTelemetry Footguns

  1. Running the collector as a single point of failure. You deploy one collector instance and route all telemetry through it. It crashes, and you lose every trace, metric, and log until it restarts. During an incident, you are blind exactly when you need visibility most.

Fix: Deploy collectors as a DaemonSet (agent tier) with a horizontally scaled gateway tier behind a headless service. Agents buffer locally; the gateway handles processing and export. Losing one agent affects one node, not the whole cluster.

War story: A fintech company ran a single OTel collector pod. During a production incident, the collector pod was evicted due to resource pressure — exactly when they needed traces to diagnose the issue. They had no telemetry for the 12-minute window that mattered most. DaemonSet deployment with priorityClassName: system-node-critical prevents this.

  1. No sampling strategy — exporting 100% of traces. Every request generates spans. At 10,000 RPS across 20 services, you are producing millions of spans per minute. Your backend costs explode. The collector OOMs. The network between collector and backend saturates.

Fix: Implement tail sampling at the gateway collector. Keep all errors and high-latency traces, probabilistically sample the rest at 5-10%. Review sampling rates quarterly as traffic grows.

  1. Mixing SDK versions across services. Team A uses opentelemetry-sdk 1.20, Team B uses 1.12, and the shared library pins 0.43b0 of a pre-release API. Semantic conventions changed between versions, attribute names differ, and context propagation subtly breaks at service boundaries.

Fix: Pin OTel SDK versions in a shared dependency manifest. Upgrade all services together. Use a monorepo plugin or dependency bot to enforce version alignment across teams.

  1. Ignoring resource attributes. Your spans arrive at the backend with service.name: unknown_service because nobody configured the resource. You cannot filter by service, environment, or version. Every dashboard query returns a wall of unlabeled data.

Fix: Always set service.name, service.version, and deployment.environment as resource attributes. Use the resourcedetection processor in the collector to auto-populate infrastructure attributes (host, k8s pod, cloud region).

  1. Putting memory_limiter after batch in the processor chain. The batch processor accumulates data in memory before flushing. If it sits before the memory limiter, a traffic spike fills the batch buffer and the collector OOMs before the limiter ever checks memory usage.

Fix: Always order processors as memory_limiter first, then everything else, then batch last. The limiter acts as a circuit breaker; the batcher acts as an efficiency buffer. Guard first, optimize second.

Remember: The OTel collector processor chain is ordered. Data flows left to right through the list in your config. The canonical safe order is: memory_limiter -> k8sattributes/resource -> filter -> transform -> batch. Think of it as: guard, enrich, filter, optimize.

  1. Using the debug exporter in production with verbosity: detailed. The debug exporter writes every span, metric, and log to stdout. In a high-throughput environment, this generates gigabytes of log output per hour, fills disk, overwhelms log aggregation, and creates backpressure in the pipeline.

Fix: Never run debug exporter with detailed verbosity in production. Use basic verbosity if you must, and only temporarily. For production debugging, use zpages or query the collector's self-metrics endpoint.

  1. Not testing collector config changes before rolling them out. You edit the collector ConfigMap, apply it, and the collector restarts with an invalid config. It crashes. All telemetry stops flowing. You scramble to revert while the incident that prompted the config change continues unmonitored.

Fix: Validate configs before applying: otelcol validate --config=config.yaml. Run a canary collector with the new config and compare output before rolling to the fleet. Treat collector config as code — review, test, deploy progressively.

  1. Forgetting context propagation at async boundaries. Your HTTP services propagate trace context automatically. But when Service A enqueues a message to Kafka and Service B consumes it, the trace breaks. Two halves of the same request appear as unrelated traces in your backend.

Fix: Manually inject trace context into message headers on the producer side and extract on the consumer side. Most OTel auto-instrumentation does NOT handle message queues. Write explicit propagation code for Kafka, RabbitMQ, SQS, and any other async transport.

Gotcha: Even when you inject context into message headers, the consumer must create a new span as a child of the extracted context — not as a root span. If you create a root span, you'll see two disconnected traces that look unrelated. The key API call is extract() on the consumer side, then start_span(context=extracted_context).

  1. Deploying the core collector when you need contrib components. Your config references the loki exporter, kafkareceiver, or awss3 exporter, but you deployed the core collector image. The collector exits on startup with "unknown component." You do not notice until someone checks why dashboards are empty.

Fix: Use otelcol-contrib if you need anything beyond the basic OTLP/Prometheus/ Jaeger components. Document which collector distribution each environment runs. Pin the image tag explicitly in your deployment manifests.

  1. Treating OTel as a drop-in replacement overnight. You rip out Prometheus, Jaeger, and Fluentd in one sprint and replace everything with OTel. Half the team's dashboards break because metric names changed. Alert rules fire on missing metrics. On-call cannot find logs because the new pipeline is not fully wired.

    Fix: Migrate incrementally. Run OTel in parallel with existing tooling. Dual- ship data to old and new backends. Migrate dashboards and alerts service-by-service. Cut over only when the new pipeline has proven stable under production load for at least two weeks.

    Default trap: OTel's default OTLP metric temporality is "cumulative" but many backends (Datadog, New Relic) expect "delta." If you migrate from a Prometheus-based setup, your counter metrics may double-count or show incorrect rates in the new backend. Set OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=delta if your backend requires it.