Comparison: Metrics Platforms¶
Category: Observability Last meaningful update consideration: 2026-03 Verdict (opinionated): Prometheus + Grafana Cloud for cost control and ecosystem fit. Datadog if budget allows and you want one pane of glass without running anything yourself.
Quick Decision Matrix¶
| Factor | Prometheus + Grafana | Datadog | New Relic | Grafana Cloud |
|---|---|---|---|---|
| Learning curve | Medium-High | Low | Low | Medium |
| Operational overhead | High (self-hosted) | None (SaaS) | None (SaaS) | Low (managed) |
| Cost at small scale | Free (self-hosted) | Expensive ($15-23/host/mo) | Free tier generous | Free tier (10k series) |
| Cost at large scale | Medium (storage) | Very expensive | Expensive | Moderate |
| Community/ecosystem | Massive (CNCF) | Vendor-controlled | Vendor-controlled | Large (Grafana Labs) |
| Hiring | Easy — standard skill | Easy — many know it | Moderate | Easy — Prometheus-compatible |
| Query language | PromQL | Proprietary | NRQL | PromQL |
| Cardinality management | Your problem | Managed (but costs $$$) | Managed | Managed + adaptive metrics |
| K8s integration | kube-prometheus-stack | Datadog Agent (DaemonSet) | K8s integration | Grafana Agent / Alloy |
| Custom metrics | Client libraries (free) | Custom metrics (charged per metric) | Events API | Client libraries (free) |
| Alerting | Alertmanager | Built-in monitors | Built-in alerts | Grafana Alerting |
| Long-term storage | Thanos / Cortex / Mimir | Included | Included | Included (Mimir) |
When to Pick Each¶
Pick Prometheus (self-hosted) when:¶
- Cost control is paramount and you have the ops capacity
- You want full control over retention, federation, and recording rules
- Your team already knows PromQL and the Prometheus ecosystem
- You are building a platform team that can operate Thanos/Mimir for long-term storage
- Vendor independence is a hard requirement
Pick Datadog when:¶
- Budget is approved and you want the least operational burden
- You want metrics, logs, traces, and APM in a single platform
- Non-technical stakeholders need dashboards without learning PromQL
- You need infrastructure monitoring beyond just K8s (cloud services, databases, third-party integrations)
- Time-to-value matters more than cost optimization
Pick New Relic when:¶
- You want a generous free tier to get started (100GB/mo free ingest)
- Your team is APM-focused and wants code-level performance insights
- You prefer NRQL (SQL-like) over PromQL for querying
- Full-stack observability with browser, mobile, and serverless in one place
Pick Grafana Cloud when:¶
- You want Prometheus-compatible metrics without running Prometheus
- You are already in the Grafana ecosystem (Loki for logs, Tempo for traces)
- You want managed Mimir for long-term storage with PromQL
- The adaptive metrics feature (auto-aggregating unused series) appeals to your cost concerns
- You want the best of both worlds: OSS compatibility with managed convenience
Nobody Tells You¶
Prometheus¶
- Cardinality explosions will ruin your day. One bad label (like
user_idorrequest_id) on a metric can generate millions of time series and OOMKill your Prometheus server. - Prometheus is not designed for long-term storage. Default retention is 15 days. For longer retention, you need Thanos, Cortex, or Mimir — each of which is its own operational project.
- Federation (scraping one Prometheus from another) sounds elegant but creates fragile dependency chains and query latency issues.
- Recording rules are essential at scale but create a secondary system you must maintain. If your recording rules drift from your dashboards, you are querying raw data and killing performance.
- The kube-prometheus-stack Helm chart is the standard K8s deploy, but it bundles so many components (Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, exporters) that upgrades require careful planning.
- Alertmanager routing configuration is its own subfield. The YAML routing tree is powerful but confusing — one wrong
continue: trueand alerts disappear.
Datadog¶
- Datadog pricing is genuinely hard to predict. Custom metrics, indexed logs, APM spans, Synthetic tests, and infrastructure hosts all have separate pricing dimensions. The bill will be higher than your estimate.
- The Datadog Agent is a DaemonSet that runs on every node. It consumes non-trivial CPU and memory. On small nodes, this matters.
- Vendor lock-in is deep. Datadog's query language, dashboard format, and monitor definitions are all proprietary. Migration means rebuilding everything.
- Datadog acquisitions (Sqreen, CoScreen, Cloudcraft) mean the platform sprawls. The UI has gotten busier and harder to navigate.
- Custom metrics pricing ($5/100 custom metrics/month) makes teams afraid to instrument their code. This is the opposite of what an observability platform should encourage.
- When Datadog has an outage, you lose visibility into your own systems. This has happened multiple times.
New Relic¶
- New Relic reinvented itself with the "one data model" approach and generous free tier. The product is better than its reputation from the 2015-era agents.
- NRQL is powerful but is not PromQL. If your team standardizes on PromQL, New Relic adds a translation burden.
- The free tier is generous but the jump to paid is steep. Watch the 100GB ingest limit carefully.
- New Relic's K8s monitoring is decent but not as mature as Datadog's or the Prometheus ecosystem.
- UI redesigns happen frequently. Bookmarked dashboard URLs break, and the learning curve resets periodically.
Grafana Cloud¶
- Grafana Cloud is essentially managed Mimir + managed Loki + managed Tempo + Grafana. Understanding this architecture helps you reason about limits and costs.
- The free tier is limited to 10,000 active series. Real-world K8s clusters easily exceed this. Budget for paid tier from the start.
- Adaptive metrics (auto-aggregating unused series) is genuinely innovative for cost control, but it means some series get downsampled without you explicitly choosing.
- Grafana Alloy (the new all-in-one collector, replacing Grafana Agent) is changing rapidly. Migration from standalone Prometheus to Alloy has rough edges.
- You are still writing PromQL, configuring recording rules, and managing cardinality — Grafana Cloud manages the storage, not the metric design.
Migration Pain Assessment¶
| From → To | Effort | Risk | Timeline |
|---|---|---|---|
| Prometheus → Grafana Cloud | Low | Low | 1-2 weeks |
| Prometheus → Datadog | High | Medium | 2-4 months |
| Datadog → Prometheus | High | Medium | 3-6 months |
| Datadog → Grafana Cloud | High | Medium | 2-4 months |
| New Relic → Datadog | Medium | Low | 1-3 months |
| New Relic → Grafana Cloud | Medium-High | Medium | 2-3 months |
| CloudWatch → Prometheus | Medium | Low | 1-2 months |
The hardest part of metrics migration is rebuilding dashboards and alerting rules. Export formats are incompatible. Budget for recreating every dashboard and monitor by hand, plus a parallel-run period to validate parity.
The Interview Answer¶
"I default to the Prometheus ecosystem because PromQL is the industry standard, the instrumentation libraries are open, and you avoid vendor lock-in. For teams that need managed Prometheus without the ops burden, Grafana Cloud gives you Mimir-backed storage with PromQL compatibility. Datadog is excellent if budget allows and you want metrics, logs, and traces in one SaaS — but the pricing model discourages heavy instrumentation, which undermines observability culture. The most important thing is not which platform you pick but whether your teams are actually instrumenting their code with meaningful metrics."
Cross-References¶
- Topic Packs: Prometheus Deep Dive, Monitoring Fundamentals, Observability Deep Dive
- Related Comparisons: Logging Platforms, Tracing Platforms, Alerting & Paging