Comparison: Metrics Platforms¶

Category: Observability Last meaningful update consideration: 2026-03 Verdict (opinionated): Prometheus + Grafana Cloud for cost control and ecosystem fit. Datadog if budget allows and you want one pane of glass without running anything yourself.

Quick Decision Matrix¶

Factor	Prometheus + Grafana	Datadog	New Relic	Grafana Cloud
Learning curve	Medium-High	Low	Low	Medium
Operational overhead	High (self-hosted)	None (SaaS)	None (SaaS)	Low (managed)
Cost at small scale	Free (self-hosted)	Expensive ($15-23/host/mo)	Free tier generous	Free tier (10k series)
Cost at large scale	Medium (storage)	Very expensive	Expensive	Moderate
Community/ecosystem	Massive (CNCF)	Vendor-controlled	Vendor-controlled	Large (Grafana Labs)
Hiring	Easy — standard skill	Easy — many know it	Moderate	Easy — Prometheus-compatible
Query language	PromQL	Proprietary	NRQL	PromQL
Cardinality management	Your problem	Managed (but costs $$$)	Managed	Managed + adaptive metrics
K8s integration	kube-prometheus-stack	Datadog Agent (DaemonSet)	K8s integration	Grafana Agent / Alloy
Custom metrics	Client libraries (free)	Custom metrics (charged per metric)	Events API	Client libraries (free)
Alerting	Alertmanager	Built-in monitors	Built-in alerts	Grafana Alerting
Long-term storage	Thanos / Cortex / Mimir	Included	Included	Included (Mimir)

When to Pick Each¶

Pick Prometheus (self-hosted) when:¶

Cost control is paramount and you have the ops capacity
You want full control over retention, federation, and recording rules
Your team already knows PromQL and the Prometheus ecosystem
You are building a platform team that can operate Thanos/Mimir for long-term storage
Vendor independence is a hard requirement

Pick Datadog when:¶

Budget is approved and you want the least operational burden
You want metrics, logs, traces, and APM in a single platform
Non-technical stakeholders need dashboards without learning PromQL
You need infrastructure monitoring beyond just K8s (cloud services, databases, third-party integrations)
Time-to-value matters more than cost optimization

Pick New Relic when:¶

You want a generous free tier to get started (100GB/mo free ingest)
Your team is APM-focused and wants code-level performance insights
You prefer NRQL (SQL-like) over PromQL for querying
Full-stack observability with browser, mobile, and serverless in one place

Pick Grafana Cloud when:¶

You want Prometheus-compatible metrics without running Prometheus
You are already in the Grafana ecosystem (Loki for logs, Tempo for traces)
You want managed Mimir for long-term storage with PromQL
The adaptive metrics feature (auto-aggregating unused series) appeals to your cost concerns
You want the best of both worlds: OSS compatibility with managed convenience

Nobody Tells You¶

Prometheus¶

Cardinality explosions will ruin your day. One bad label (like user_id or request_id) on a metric can generate millions of time series and OOMKill your Prometheus server.
Prometheus is not designed for long-term storage. Default retention is 15 days. For longer retention, you need Thanos, Cortex, or Mimir — each of which is its own operational project.
Federation (scraping one Prometheus from another) sounds elegant but creates fragile dependency chains and query latency issues.
Recording rules are essential at scale but create a secondary system you must maintain. If your recording rules drift from your dashboards, you are querying raw data and killing performance.
The kube-prometheus-stack Helm chart is the standard K8s deploy, but it bundles so many components (Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, exporters) that upgrades require careful planning.
Alertmanager routing configuration is its own subfield. The YAML routing tree is powerful but confusing — one wrong continue: true and alerts disappear.

Datadog¶

Datadog pricing is genuinely hard to predict. Custom metrics, indexed logs, APM spans, Synthetic tests, and infrastructure hosts all have separate pricing dimensions. The bill will be higher than your estimate.
The Datadog Agent is a DaemonSet that runs on every node. It consumes non-trivial CPU and memory. On small nodes, this matters.
Vendor lock-in is deep. Datadog's query language, dashboard format, and monitor definitions are all proprietary. Migration means rebuilding everything.
Datadog acquisitions (Sqreen, CoScreen, Cloudcraft) mean the platform sprawls. The UI has gotten busier and harder to navigate.
Custom metrics pricing ($5/100 custom metrics/month) makes teams afraid to instrument their code. This is the opposite of what an observability platform should encourage.
When Datadog has an outage, you lose visibility into your own systems. This has happened multiple times.

New Relic¶

New Relic reinvented itself with the "one data model" approach and generous free tier. The product is better than its reputation from the 2015-era agents.
NRQL is powerful but is not PromQL. If your team standardizes on PromQL, New Relic adds a translation burden.
The free tier is generous but the jump to paid is steep. Watch the 100GB ingest limit carefully.
New Relic's K8s monitoring is decent but not as mature as Datadog's or the Prometheus ecosystem.
UI redesigns happen frequently. Bookmarked dashboard URLs break, and the learning curve resets periodically.

Grafana Cloud¶

Grafana Cloud is essentially managed Mimir + managed Loki + managed Tempo + Grafana. Understanding this architecture helps you reason about limits and costs.
The free tier is limited to 10,000 active series. Real-world K8s clusters easily exceed this. Budget for paid tier from the start.
Adaptive metrics (auto-aggregating unused series) is genuinely innovative for cost control, but it means some series get downsampled without you explicitly choosing.
Grafana Alloy (the new all-in-one collector, replacing Grafana Agent) is changing rapidly. Migration from standalone Prometheus to Alloy has rough edges.
You are still writing PromQL, configuring recording rules, and managing cardinality — Grafana Cloud manages the storage, not the metric design.

Migration Pain Assessment¶

From → To	Effort	Risk	Timeline
Prometheus → Grafana Cloud	Low	Low	1-2 weeks
Prometheus → Datadog	High	Medium	2-4 months
Datadog → Prometheus	High	Medium	3-6 months
Datadog → Grafana Cloud	High	Medium	2-4 months
New Relic → Datadog	Medium	Low	1-3 months
New Relic → Grafana Cloud	Medium-High	Medium	2-3 months
CloudWatch → Prometheus	Medium	Low	1-2 months

The hardest part of metrics migration is rebuilding dashboards and alerting rules. Export formats are incompatible. Budget for recreating every dashboard and monitor by hand, plus a parallel-run period to validate parity.

The Interview Answer¶

"I default to the Prometheus ecosystem because PromQL is the industry standard, the instrumentation libraries are open, and you avoid vendor lock-in. For teams that need managed Prometheus without the ops burden, Grafana Cloud gives you Mimir-backed storage with PromQL compatibility. Datadog is excellent if budget allows and you want metrics, logs, and traces in one SaaS — but the pricing model discourages heavy instrumentation, which undermines observability culture. The most important thing is not which platform you pick but whether your teams are actually instrumenting their code with meaningful metrics."

Cross-References¶

Topic Packs: Prometheus Deep Dive, Monitoring Fundamentals, Observability Deep Dive
Related Comparisons: Logging Platforms, Tracing Platforms, Alerting & Paging