Comparison: Tracing Platforms¶
Category: Observability Last meaningful update consideration: 2026-03 Verdict (opinionated): Tempo if you are in the Grafana ecosystem — it is the cheapest to operate and queries by trace ID. Datadog APM if you are already paying Datadog. Jaeger is solid but Tempo is eating its lunch.
Quick Decision Matrix¶
| Factor | Jaeger | Tempo | Zipkin | Datadog APM |
|---|---|---|---|---|
| Learning curve | Medium | Low-Medium | Low | Low |
| Operational overhead | Medium-High | Low | Medium | None (SaaS) |
| Cost at small scale | Free | Free | Free | Expensive (per-host + span ingest) |
| Cost at large scale | Medium (storage) | Low (object storage) | Medium | Very expensive |
| Community/ecosystem | Large (CNCF) | Growing (Grafana Labs) | Moderate (legacy) | Vendor-controlled |
| Hiring | Moderate | Growing | Declining | Easy |
| Query capabilities | Search by tags, service, time | Trace ID lookup + TraceQL | Search by tags, service | Full search + analytics |
| Storage backend | Elasticsearch, Cassandra, Kafka | S3/GCS/Azure Blob | Elasticsearch, Cassandra, MySQL | Proprietary |
| K8s integration | Jaeger Operator | Helm chart, Grafana Alloy | Helm chart | Datadog Agent |
| OpenTelemetry | Native (OTLP receiver) | Native (OTLP receiver) | Requires translation | Native |
| Service map | Built-in | Via Grafana | Built-in | Excellent |
| Tail-based sampling | Jaeger Remote Sampling | Grafana Alloy / OTel Collector | Limited | Built-in |
When to Pick Each¶
Pick Jaeger when:¶
- You need a mature, self-hosted tracing backend with strong search capabilities
- You already have Elasticsearch or Cassandra in your stack for trace storage
- You want the CNCF-backed standard and are not in the Grafana ecosystem
- Your team needs a standalone tracing UI with good trace comparison features
- You require adaptive sampling with the Jaeger remote sampling protocol
Pick Tempo when:¶
- You are already using Grafana + Prometheus + Loki and want traces in the same stack
- Cost is a priority — Tempo uses object storage (S3/GCS), which is dramatically cheaper
- Your primary query pattern is "give me the trace for this trace ID" (from logs or metrics exemplars)
- You want TraceQL for searching traces by span attributes without building a full index
- You are deploying OpenTelemetry and want a simple, OTLP-native backend
Pick Zipkin when:¶
- You have an existing Zipkin deployment and migration cost is not justified
- Your tracing needs are basic — simple request tracing across a few services
- You want the simplest possible self-hosted tracing setup
- You should not pick Zipkin for new projects — Jaeger and Tempo are better choices
Pick Datadog APM when:¶
- You are already paying for Datadog and want traces correlated with metrics and logs
- Your team wants code-level flame graphs and profiling without additional tooling
- You need automated service dependency maps without manual instrumentation
- Budget is approved and operational simplicity is the priority
- You want live tail and real-time trace search
Nobody Tells You¶
Jaeger¶
- Jaeger's storage backend choice is critical and hard to change later. Elasticsearch gives you search but is expensive to operate. Cassandra gives you scale but is even harder to operate.
- The Jaeger Operator simplifies K8s deployment but adds another operator to manage. Sidecar injection works but consumes resources on every pod.
- Jaeger UI is functional but has not seen major improvements in years. The trace comparison feature is useful but the overall UX feels dated compared to Datadog.
- Sampling strategy decisions are permanent in practice. Head-based sampling (decide at trace start) misses errors in unsampled traces. Tail-based sampling (decide at trace end) requires buffering all spans.
- At high volume, Jaeger ingestion pipelines (Collector → Kafka → Ingester → Storage) are complex. Each component can bottleneck, and debugging throughput issues requires understanding all of them.
- Jaeger and Tempo share the same problem: tracing only becomes valuable when every service is instrumented. One gap in the chain breaks the trace.
Tempo¶
- Tempo does NOT have a full search index. TraceQL helps, but searching for "all traces with HTTP 500 errors in the last hour" is slower than Jaeger with Elasticsearch. Tempo is optimized for "trace ID → full trace" lookup.
- The exemplar workflow (metrics → exemplar → trace ID → Tempo) is the intended path. If your metrics do not have exemplars, Tempo's value drops significantly.
- TraceQL is powerful but still maturing. Complex queries (e.g., comparing span durations across services) can be slow on large datasets.
- Tempo in microservices mode requires careful tuning of distributors, ingestors, compactors, and queriers. Start with monolithic mode.
- Object storage latency means trace retrieval is slower than Elasticsearch-backed Jaeger. 1-3 seconds for a trace lookup is normal.
- Grafana's trace-to-logs and trace-to-metrics features are the killer integration — but they require consistent label naming across all three systems (Prometheus, Loki, Tempo).
Zipkin¶
- Zipkin's B3 propagation headers are legacy. The industry has standardized on W3C Trace Context. If you are starting new, use W3C.
- The Zipkin community is small and updates are infrequent. It works for what it does but is not evolving.
- Zipkin's span model predates OpenTelemetry. Translation between Zipkin's format and OTLP is possible but lossy.
- Storage options (Elasticsearch, Cassandra, MySQL) are the same as Jaeger but with less optimization and fewer deployment patterns.
Datadog APM¶
- Datadog APM pricing is per-host plus per-span-ingestion. At high request volumes, span ingestion costs can exceed the per-host fee.
- Datadog's ingestion sampling is opaque. You configure a target rate, but Datadog decides which traces to keep. Debugging "why was this trace not captured" is frustrating.
- The continuous profiler is genuinely useful but adds CPU overhead on every service. In latency-sensitive paths, this matters.
- Datadog APM lock-in is strong. Datadog-specific instrumentation libraries, custom span tags, and dashboard formats do not export cleanly.
- When Datadog has an outage, you lose not just monitoring but also APM and trace data. There is no local fallback.
Migration Pain Assessment¶
| From → To | Effort | Risk | Timeline |
|---|---|---|---|
| Jaeger → Tempo | Medium | Low | 1-2 months |
| Zipkin → Jaeger | Low-Medium | Low | 2-4 weeks |
| Zipkin → Tempo | Medium | Low | 1-2 months |
| Datadog APM → Jaeger | High | Medium | 3-6 months |
| Datadog APM → Tempo | High | Medium | 3-6 months |
| Jaeger → Datadog APM | Medium | Low | 1-2 months |
The OpenTelemetry Collector is the key to painless migration. If you run an OTel Collector as your ingestion point, switching backends is a config change — you swap the exporter, not the instrumentation. Invest in OTel Collector first, choose the backend second.
The Interview Answer¶
"Tracing is the hardest pillar of observability to get right because it requires every service to be instrumented. My approach is to standardize on OpenTelemetry for instrumentation and use the OTel Collector as the ingestion point — this makes the backend choice reversible. For the backend, Tempo is the most cost-effective option if you are in the Grafana ecosystem because it uses object storage and integrates natively with Prometheus exemplars and Loki logs. The real value of tracing is not looking at individual traces — it is building service dependency maps, identifying latency bottlenecks, and drilling from a metric anomaly to the specific slow trace."
Cross-References¶
- Topic Packs: Tracing, OpenTelemetry, Observability Deep Dive
- Related Comparisons: Metrics Platforms, Logging Platforms, Alerting & Paging