Comparison: Tracing Platforms¶

Category: Observability Last meaningful update consideration: 2026-03 Verdict (opinionated): Tempo if you are in the Grafana ecosystem — it is the cheapest to operate and queries by trace ID. Datadog APM if you are already paying Datadog. Jaeger is solid but Tempo is eating its lunch.

Quick Decision Matrix¶

Factor	Jaeger	Tempo	Zipkin	Datadog APM
Learning curve	Medium	Low-Medium	Low	Low
Operational overhead	Medium-High	Low	Medium	None (SaaS)
Cost at small scale	Free	Free	Free	Expensive (per-host + span ingest)
Cost at large scale	Medium (storage)	Low (object storage)	Medium	Very expensive
Community/ecosystem	Large (CNCF)	Growing (Grafana Labs)	Moderate (legacy)	Vendor-controlled
Hiring	Moderate	Growing	Declining	Easy
Query capabilities	Search by tags, service, time	Trace ID lookup + TraceQL	Search by tags, service	Full search + analytics
Storage backend	Elasticsearch, Cassandra, Kafka	S3/GCS/Azure Blob	Elasticsearch, Cassandra, MySQL	Proprietary
K8s integration	Jaeger Operator	Helm chart, Grafana Alloy	Helm chart	Datadog Agent
OpenTelemetry	Native (OTLP receiver)	Native (OTLP receiver)	Requires translation	Native
Service map	Built-in	Via Grafana	Built-in	Excellent
Tail-based sampling	Jaeger Remote Sampling	Grafana Alloy / OTel Collector	Limited	Built-in

When to Pick Each¶

Pick Jaeger when:¶

You need a mature, self-hosted tracing backend with strong search capabilities
You already have Elasticsearch or Cassandra in your stack for trace storage
You want the CNCF-backed standard and are not in the Grafana ecosystem
Your team needs a standalone tracing UI with good trace comparison features
You require adaptive sampling with the Jaeger remote sampling protocol

Pick Tempo when:¶

You are already using Grafana + Prometheus + Loki and want traces in the same stack
Cost is a priority — Tempo uses object storage (S3/GCS), which is dramatically cheaper
Your primary query pattern is "give me the trace for this trace ID" (from logs or metrics exemplars)
You want TraceQL for searching traces by span attributes without building a full index
You are deploying OpenTelemetry and want a simple, OTLP-native backend

Pick Zipkin when:¶

You have an existing Zipkin deployment and migration cost is not justified
Your tracing needs are basic — simple request tracing across a few services
You want the simplest possible self-hosted tracing setup
You should not pick Zipkin for new projects — Jaeger and Tempo are better choices

Pick Datadog APM when:¶

You are already paying for Datadog and want traces correlated with metrics and logs
Your team wants code-level flame graphs and profiling without additional tooling
You need automated service dependency maps without manual instrumentation
Budget is approved and operational simplicity is the priority
You want live tail and real-time trace search

Nobody Tells You¶

Jaeger¶

Jaeger's storage backend choice is critical and hard to change later. Elasticsearch gives you search but is expensive to operate. Cassandra gives you scale but is even harder to operate.
The Jaeger Operator simplifies K8s deployment but adds another operator to manage. Sidecar injection works but consumes resources on every pod.
Jaeger UI is functional but has not seen major improvements in years. The trace comparison feature is useful but the overall UX feels dated compared to Datadog.
Sampling strategy decisions are permanent in practice. Head-based sampling (decide at trace start) misses errors in unsampled traces. Tail-based sampling (decide at trace end) requires buffering all spans.
At high volume, Jaeger ingestion pipelines (Collector → Kafka → Ingester → Storage) are complex. Each component can bottleneck, and debugging throughput issues requires understanding all of them.
Jaeger and Tempo share the same problem: tracing only becomes valuable when every service is instrumented. One gap in the chain breaks the trace.

Tempo¶

Tempo does NOT have a full search index. TraceQL helps, but searching for "all traces with HTTP 500 errors in the last hour" is slower than Jaeger with Elasticsearch. Tempo is optimized for "trace ID → full trace" lookup.
The exemplar workflow (metrics → exemplar → trace ID → Tempo) is the intended path. If your metrics do not have exemplars, Tempo's value drops significantly.
TraceQL is powerful but still maturing. Complex queries (e.g., comparing span durations across services) can be slow on large datasets.
Tempo in microservices mode requires careful tuning of distributors, ingestors, compactors, and queriers. Start with monolithic mode.
Object storage latency means trace retrieval is slower than Elasticsearch-backed Jaeger. 1-3 seconds for a trace lookup is normal.
Grafana's trace-to-logs and trace-to-metrics features are the killer integration — but they require consistent label naming across all three systems (Prometheus, Loki, Tempo).

Zipkin¶

Zipkin's B3 propagation headers are legacy. The industry has standardized on W3C Trace Context. If you are starting new, use W3C.
The Zipkin community is small and updates are infrequent. It works for what it does but is not evolving.
Zipkin's span model predates OpenTelemetry. Translation between Zipkin's format and OTLP is possible but lossy.
Storage options (Elasticsearch, Cassandra, MySQL) are the same as Jaeger but with less optimization and fewer deployment patterns.

Datadog APM¶

Datadog APM pricing is per-host plus per-span-ingestion. At high request volumes, span ingestion costs can exceed the per-host fee.
Datadog's ingestion sampling is opaque. You configure a target rate, but Datadog decides which traces to keep. Debugging "why was this trace not captured" is frustrating.
The continuous profiler is genuinely useful but adds CPU overhead on every service. In latency-sensitive paths, this matters.
Datadog APM lock-in is strong. Datadog-specific instrumentation libraries, custom span tags, and dashboard formats do not export cleanly.
When Datadog has an outage, you lose not just monitoring but also APM and trace data. There is no local fallback.

Migration Pain Assessment¶

From → To	Effort	Risk	Timeline
Jaeger → Tempo	Medium	Low	1-2 months
Zipkin → Jaeger	Low-Medium	Low	2-4 weeks
Zipkin → Tempo	Medium	Low	1-2 months
Datadog APM → Jaeger	High	Medium	3-6 months
Datadog APM → Tempo	High	Medium	3-6 months
Jaeger → Datadog APM	Medium	Low	1-2 months

The OpenTelemetry Collector is the key to painless migration. If you run an OTel Collector as your ingestion point, switching backends is a config change — you swap the exporter, not the instrumentation. Invest in OTel Collector first, choose the backend second.

The Interview Answer¶

"Tracing is the hardest pillar of observability to get right because it requires every service to be instrumented. My approach is to standardize on OpenTelemetry for instrumentation and use the OTel Collector as the ingestion point — this makes the backend choice reversible. For the backend, Tempo is the most cost-effective option if you are in the Grafana ecosystem because it uses object storage and integrates natively with Prometheus exemplars and Loki logs. The real value of tracing is not looking at individual traces — it is building service dependency maps, identifying latency bottlenecks, and drilling from a metric anomaly to the specific slow trace."

Cross-References¶

Topic Packs: Tracing, OpenTelemetry, Observability Deep Dive
Related Comparisons: Metrics Platforms, Logging Platforms, Alerting & Paging