Skip to content

Prometheus: Under the Hood

  • lesson
  • prometheus-tsdb
  • promql
  • cardinality
  • service-discovery
  • alerting-pipeline
  • long-term-storage
  • high-availability
  • l2 ---# Prometheus -- Under the Hood

Topics: Prometheus TSDB, PromQL, cardinality, service discovery, alerting pipeline, long-term storage, high availability Level: L2 (Operations) Time: 90--120 minutes Strategy: Build-up + incident-driven


The Mission

It's 2 AM. PagerDuty fires: Prometheus is using 80 GB of RAM on a box with 96 GB. Queries to Grafana are timing out. The on-call Slack channel is full of "dashboards are blank" messages. Your monitoring system -- the thing that watches everything else -- is about to fall over.

You need to figure out why Prometheus is eating memory, stop the bleeding, and make sure it never happens again. To do that, you need to understand how Prometheus actually works inside -- not the marketing overview, but the storage engine, the query model, and the places where things go wrong.

Let's build that understanding from the ground up, then use it to save the night.


Part 1: The TSDB -- Where Your Metrics Live

Every sample Prometheus scrapes lands in its local time-series database. Understanding this storage engine is the difference between "restart it and pray" and "I know exactly what's wrong."

The Write Path: WAL, Head Block, Persistent Blocks

When Prometheus scrapes a target, the sample doesn't go straight to disk as a nice compressed file. It takes a journey:

Scrape → WAL (write-ahead log) → Head Block (in memory) → Persistent Block (on disk)

Step 1: The WAL. Every incoming sample is first appended to the Write-Ahead Log -- a sequential, append-only file on disk. This is your crash recovery insurance. If Prometheus dies mid-scrape, it replays the WAL on startup to recover samples that hadn't been persisted yet.

# The WAL lives here
ls -lh /prometheus/wal/
# You'll see numbered segment files: 00000001, 00000002, ...
# Each segment is up to 128 MB

Under the Hood: The WAL design is borrowed from databases like PostgreSQL and LevelDB. The idea: sequential writes to an append-only log are fast and durable. Random writes to a structured database are slow. So write fast first (WAL), structure later (compaction).

Step 2: The Head Block. Samples accumulate in memory in the "head block" -- a structure optimized for recent data that's still being written to. The head block covers roughly the last 2 hours of data (configurable via --storage.tsdb.min-block-duration).

This is where your RAM goes. Every active time series has an in-memory representation in the head block. More series = more memory.

Step 3: Compaction. Every 2 hours, the head block is "cut" -- its contents are compressed and written to a persistent block on disk. Prometheus also merges smaller blocks into larger ones over time (compaction), which improves query performance and reduces disk usage.

Time →
┌─────────┐  ┌─────────┐  ┌─────────┐  ┌──────────────────┐
│ Block 1 │  │ Block 2 │  │ Block 3 │  │   Head Block     │
│ 0-2h    │  │ 2-4h    │  │ 4-6h    │  │   (in memory)    │
│ (disk)  │  │ (disk)  │  │ (disk)  │  │   6h-now         │
└─────────┘  └─────────┘  └─────────┘  └──────────────────┘
                                         ↑ WAL backs this up

Gotcha: The WAL can grow very large during high churn (lots of new series appearing and disappearing). Kubernetes environments with frequent pod churn are especially vulnerable. A 10 GB WAL is a sign something is wrong. Monitor it: du -sh /prometheus/wal/

Why This Matters for Our Incident

At 80 GB of RAM, the head block is enormous. That means either: 1. There are a massive number of active time series, or 2. The head block hasn't been compacted and covers too long a time range

In practice, it's almost always #1. Let's find out how many series we have.

curl -s http://prometheus:9090/api/v1/status/tsdb | jq '.data.headStats'
{
  "numSeries": 12847291,
  "numLabelPairs": 38541873,
  "chunkCount": 51389164,
  "minTime": 1711036800000,
  "maxTime": 1711044000000
}

12.8 million active time series. For a cluster running a few hundred services, a healthy number is 200K--1M. We're 10x over. Something is creating series like they're free.


Flashcard Check: TSDB Basics

Question Answer
What are the three stages of Prometheus's write path? WAL (append-only log on disk) -> Head Block (in memory) -> Persistent Block (compressed on disk)
Why does the WAL exist? Crash recovery. If Prometheus dies, it replays the WAL to recover uncompacted samples.
What determines Prometheus's memory usage? Primarily the number of active time series in the head block. More series = more RAM.
How often does the head block get compacted to disk? Approximately every 2 hours (controlled by --storage.tsdb.min-block-duration).

Part 2: Metric Types -- The Building Blocks

Before we hunt the cardinality bomb, you need to know what kinds of metrics exist and how they behave. There are four types, and picking the wrong one is a common source of confusion.

Counter

A number that only goes up. Resets to zero when the process restarts.

http_requests_total{method="GET", status="200"} 145232

You never alert on the raw value. A counter of 145,232 tells you nothing. The rate of change tells you everything:

rate(http_requests_total[5m])       # requests per second, averaged over 5 minutes
increase(http_requests_total[1h])   # total increase over the past hour

Name Origin: The term "counter" in Prometheus comes from the same concept in hardware performance counters -- CPU registers that only increment when an event occurs (cache miss, branch misprediction). You never read the raw counter; you read the difference between two readings.

The Counter Reset Problem

What happens when a service restarts and the counter drops from 150,000 to 0?

rate() handles this. It detects when a value decreases (which should never happen for a counter) and assumes a reset occurred. It calculates the rate using only the post-reset samples.

But irate() -- which uses only the last two data points -- can produce a brief spike at the reset boundary because of interpolation artifacts.

Function How it works Use for
rate() Average rate across the full range Alerting, recording rules
irate() Instantaneous rate from last two points Dashboards (responsive but noisy)
increase() Total increase over range "How many errors in the last hour?"

Remember: rate() for alerting, irate() for dashboards. If you alert on irate(), counter resets will page you at 3 AM for nothing.

Gauge

A value that goes up and down. Current temperature, memory in use, queue depth.

node_memory_MemAvailable_bytes 4294967296
# Alert when memory is low
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9

# Is the disk filling up? (rate of change of a gauge)
deriv(node_filesystem_free_bytes{mountpoint="/"}[1h])

# Predict when disk hits zero (linear extrapolation)
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0

Gotcha: Never use rate() on a gauge. rate() assumes values only go up and treats decreases as counter resets. On a gauge, a decrease from 8 to 2 looks like a "reset" and produces garbage. Use deriv() for rate of change on gauges.

Histogram

This is where things get interesting (and where cardinality gets expensive). A histogram counts observations in pre-defined buckets.

http_request_duration_seconds_bucket{handler="/api/users", le="0.005"} 12000
http_request_duration_seconds_bucket{handler="/api/users", le="0.01"}  14500
http_request_duration_seconds_bucket{handler="/api/users", le="0.025"} 15200
http_request_duration_seconds_bucket{handler="/api/users", le="0.05"}  15400
http_request_duration_seconds_bucket{handler="/api/users", le="0.1"}   15450
http_request_duration_seconds_bucket{handler="/api/users", le="0.25"}  15480
http_request_duration_seconds_bucket{handler="/api/users", le="0.5"}   15490
http_request_duration_seconds_bucket{handler="/api/users", le="1"}     15495
http_request_duration_seconds_bucket{handler="/api/users", le="+Inf"}  15500
http_request_duration_seconds_sum{handler="/api/users"} 103.42
http_request_duration_seconds_count{handler="/api/users"} 15500

That's 11 time series for one handler on one instance. Ten buckets plus _sum and _count minus the shared count. Across 50 handlers and 20 pods, that's 11,000 series from a single histogram metric.

# p99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# p50 (median)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

# Average latency (no buckets needed)
rate(http_request_duration_seconds_sum[5m])
/ rate(http_request_duration_seconds_count[5m])

Why Bucket Boundaries Matter

histogram_quantile() linearly interpolates between bucket boundaries. If your SLO says "99% of requests under 200ms" but your nearest buckets are le="0.1" (100ms) and le="0.25" (250ms), the p99 calculation is an approximation that can be wildly off.

# Fix: add buckets at your SLO boundaries
from prometheus_client import Histogram

request_latency = Histogram(
    'http_request_duration_seconds',
    'Request latency',
    ['method', 'handler'],
    buckets=[.005, .01, .025, .05, .1, .15, .2, .25, .3, .5, 1, 2.5, 5, 10]
    #                                   ^^^  ^^^  ^^^
    #                           Added around the 200ms SLO boundary
)

Mental Model: Think of histogram buckets like a ruler. If your ruler only has marks at 1cm and 10cm, measuring something 3.7cm long gives you a bad answer. Put marks where you actually need precision.

Summary

Pre-computes quantiles client-side. Cheaper for Prometheus to store, but you cannot aggregate summaries across instances. The average of p99 values from 10 instances is not the p99 of the combined distribution. That's just math.

Histogram Summary
Aggregation Yes (aggregate buckets, then compute quantile) No (pre-computed quantiles can't be combined)
Bucket config Server-side, changeable without redeploy Client-side, fixed at instrumentation time
Cost More series (one per bucket) Fewer series
Use when You need cross-instance percentiles (almost always) Per-instance quantiles and you'll never aggregate

Default choice: histogram. Unless you have a specific reason, always use histograms.


Part 3: The Cardinality Bomb -- Diagnosing the Incident

Back to our 2 AM crisis. 12.8 million series. Let's find the offender.

Step 1: Find the Top Metrics

curl -s http://prometheus:9090/api/v1/status/tsdb | \
  jq '[.data.seriesCountByMetricName[:10][] | {name: .name, count: .value}]'
[
  { "name": "http_request_duration_seconds_bucket", "count": 8547000 },
  { "name": "http_request_duration_seconds_count",  "count": 854700 },
  { "name": "http_request_duration_seconds_sum",    "count": 854700 },
  { "name": "node_cpu_seconds_total",               "count": 128000 },
  { "name": "container_memory_working_set_bytes",    "count": 95000 }
]

8.5 million series from one histogram's bucket metric. That's our cardinality bomb.

Step 2: Find the Exploding Label

# Which label has the most unique values?
curl -s http://prometheus:9090/api/v1/status/tsdb | \
  jq '[.data.labelValueCountByLabelName[:10][] | {label: .name, values: .value}]'
[
  { "label": "request_path", "values": 847000 },
  { "label": "pod",          "values": 1200 },
  { "label": "le",           "values": 11 },
  { "label": "method",       "values": 5 }
]

847,000 unique values for request_path. Someone instrumented their HTTP middleware to use the raw URL path -- /api/v1/users/12345, /api/v1/users/67890 -- instead of the route template /api/v1/users/:id.

Mental Model: Cardinality is multiplicative. Labels don't add -- they multiply. 10 histogram buckets x 5 methods x 847,000 paths x 2 instances = 84.7 million potential series. The request_path label alone turned a 1,000-series metric into a multi-million series monster.

Step 3: Find Which Service Is Doing This

curl -s 'http://prometheus:9090/api/v1/query?query=count(http_request_duration_seconds_bucket) by (job)' | \
  jq '[.data.result[] | {job: .metric.job, series: .value[1]}] | sort_by(-.series)'
[
  { "job": "user-service", "series": "8200000" },
  { "job": "api-gateway",  "series": "45000" },
  { "job": "auth-service", "series": "12000" }
]

It's user-service. 8.2 million series from one service.

Step 4: Stop the Bleeding (Now)

The instrumentation fix needs a code deploy. That's hours away. We stop the ingestion immediately with a metric relabel config:

# Add to the user-service scrape config in prometheus.yml
metric_relabel_configs:
  - source_labels: [__name__, request_path]
    regex: 'http_request_duration_seconds_(bucket|count|sum);/api/v1/users/[0-9]+'
    action: drop
# Reload the config (Prometheus must have --web.enable-lifecycle)
curl -XPOST http://prometheus:9090/-/reload

Dropped series stop being ingested on the next scrape cycle. Memory won't drop immediately -- the head block keeps existing series until the next compaction -- but growth stops.

War Story: At one company, a single developer added a trace_id label to a request counter during a debugging session and forgot to remove it. Each request generated a unique trace ID. Within 6 hours, the metric had created over 100,000 time series. Prometheus went from 4 GB to 35 GB of RAM, queries started timing out, and every Grafana dashboard went blank. The monitoring system that was supposed to detect problems became the problem. The fix was a two-line metric_relabel_config -- but finding the cause took 45 minutes of panic at 3 AM. The postmortem action item: a standing alert on cardinality growth.

Step 5: Prevent It From Happening Again

# Alert when any single metric has too many series
groups:
  - name: cardinality-watchdog
    rules:
      - alert: HighCardinalityMetric
        expr: count by (__name__) ({__name__=~".+"}) > 50000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Metric {{ $labels.__name__ }} has {{ $value }} series"
          runbook: "https://wiki.internal/runbooks/cardinality-explosion"

Flashcard Check: Cardinality

Question Answer
What makes label cardinality dangerous? Cardinality is multiplicative. Each label's unique values multiply with every other label's values to determine total series count.
Name three label values that should never be Prometheus labels. User IDs, request IDs/trace IDs, UUIDs, email addresses, raw URL paths -- anything unbounded.
How do you find the top metrics by series count? curl http://prometheus:9090/api/v1/status/tsdb and inspect seriesCountByMetricName.
How do you stop a cardinality explosion without a code deploy? Add metric_relabel_configs with action: drop to the scrape config and reload Prometheus.

Part 4: PromQL Deep Dive

Now that we've saved the night, let's go deeper on the query language. PromQL is deceptively simple -- until you need to write a real alert.

rate() vs irate() -- When It Matters

Both compute per-second rates from counters. The difference is in how much data they use:

Samples:  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·
          |← ——— rate() uses all of these ———→|
                                         |←→|
                                    irate() uses these two

rate() averages across the entire range, smoothing out spikes. irate() reacts instantly to the latest change but is noisy.

# Smooth, stable -- good for alerting
rate(http_requests_total{job="api-server"}[5m])

# Responsive, spiky -- good for dashboards
irate(http_requests_total{job="api-server"}[5m])

Gotcha: rate() needs at least two samples in the range window. With a 15-second scrape interval, rate(metric[30s]) gives you exactly two samples -- and if one scrape is late, you get zero. Use a range of at least 4x your scrape interval. The safe default for 15-second scrapes: [1m] minimum, [5m] for alerting.

Aggregation Operators

# Total request rate across all instances
sum(rate(http_requests_total[5m]))

# Grouped by status code
sum by (status) (rate(http_requests_total[5m]))

# Everything EXCEPT the instance label
sum without (instance) (rate(http_requests_total[5m]))

# Top 5 handlers by request rate
topk(5, sum by (handler) (rate(http_requests_total[5m])))

# How many targets are up?
count(up == 1)

histogram_quantile() -- The Function Everyone Gets Wrong

# p99 latency across all instances (correct)
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

The by (le) is critical. histogram_quantile needs the le (less-than-or-equal) label to know the bucket boundaries. If you aggregate away le, you get garbage.

Want per-handler p99?

histogram_quantile(0.99,
  sum by (handler, le) (rate(http_request_duration_seconds_bucket[5m]))
)

The rule: always keep le in your by clause when using histogram_quantile.

Recording Rules: Pre-Computing Expensive Queries

If that histogram_quantile query takes 8 seconds to evaluate, pre-compute it:

groups:
  - name: latency-recording
    interval: 30s
    rules:
      - record: job:http_request_duration_seconds:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

      - record: job:http_error_rate:ratio_5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))

Trivia: The naming convention level:metric:operations (like job:http_requests_total:rate5m) was established in the Prometheus documentation and follows a pattern borrowed from Borgmon, Google's internal monitoring system that inspired Prometheus. The convention makes it immediately clear what aggregation level and operations were applied.


Part 5: Service Discovery -- How Prometheus Finds Targets

Static configs work for 5 servers. In Kubernetes with pods spinning up and down every minute, you need dynamic discovery.

Kubernetes Service Discovery

scrape_configs:
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with the annotation prometheus.io/scrape: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use the pod's prometheus.io/port annotation
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      # Carry pod metadata as labels
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

The Prometheus Operator Way (ServiceMonitors)

In most Kubernetes clusters, you don't write raw scrape configs. The Prometheus Operator manages everything via CRDs:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-service
  labels:
    team: platform
spec:
  selector:
    matchLabels:
      app: api-service
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
  namespaceSelector:
    matchNames:
      - production

Gotcha: ServiceMonitor not being picked up is the #1 Prometheus Operator debugging question. It's a two-level selector problem: 1. The Prometheus CR selects which ServiceMonitors to load (serviceMonitorSelector) 2. Each ServiceMonitor selects which Services to scrape (selector)

Missing either level = zero targets, zero errors. Check both:

kubectl get prometheus -n monitoring -o yaml | grep -A5 serviceMonitorSelector
kubectl get servicemonitors -A --show-labels

Other Discovery Mechanisms

Mechanism Use case
static_configs Small, fixed fleets
kubernetes_sd_configs Kubernetes pods, services, nodes
ec2_sd_configs AWS EC2 instances by tag
consul_sd_configs Consul service registry
file_sd_configs JSON/YAML files (good for custom scripts that output targets)
dns_sd_configs DNS SRV records

Relabeling: The Swiss Army Knife

Relabeling transforms labels at two stages:

  • relabel_configs -- before scraping (controls what gets scraped)
  • metric_relabel_configs -- after scraping (controls what gets stored)
# Drop all go_gc internal metrics to save cardinality
metric_relabel_configs:
  - source_labels: [__name__]
    regex: "go_gc_.*"
    action: drop
  # Remove an unbounded label
  - regex: "request_id"
    action: labeldrop

Part 6: The Alerting Pipeline

Prometheus doesn't send alerts to Slack directly. The pipeline has distinct stages, and each one can fail silently if misconfigured.

Alert Rules (Prometheus) → Alertmanager → Receivers (Slack, PagerDuty, email)
       ↓                      ↓
  "Is condition true        "Route, group,
   for 5 minutes?"          deduplicate, silence"

Alert Rules

groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High 5xx error rate: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.internal/runbooks/high-error-rate"

The for: 5m is your debounce. The condition must be true for 5 continuous minutes before the alert fires. Without it, a single bad scrape pages you at 3 AM.

Trivia: The Three Mile Island nuclear accident in 1979 was worsened by over 100 simultaneous alarms, many contradictory. The alarm printer fell 2 hours behind real-time. Operators couldn't distinguish critical warnings from noise. This incident became a foundational case study in alarm management and directly influenced how modern alerting systems use severity levels, grouping, and deduplication.

Alertmanager: Routing, Grouping, Inhibition

# alertmanager.yml
route:
  receiver: default-slack
  group_by: [alertname, cluster, namespace]
  group_wait: 30s       # Wait before sending the first notification
  group_interval: 5m    # Wait between subsequent notifications for same group
  repeat_interval: 4h   # Don't re-notify for the same alert before this
  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
      continue: true     # Also match the next route
    - match:
        severity: critical
      receiver: slack-critical

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: [alertname, cluster, namespace]

Grouping batches related alerts. Without it, 50 pods OOMKilling in one namespace sends 50 separate Slack messages.

Inhibition suppresses downstream alerts. If NodeDown fires critical, suppress all warning-level pod alerts on that node -- the pods can't run on a dead node.

Silences temporarily mute alerts during maintenance:

amtool silence add alertname="DiskFillingUp" instance="node3:9100" \
  --comment="Replacing disk on node3" --duration=4h

Gotcha: Unbounded silences (no expiry) are the #1 cause of missed incidents. Always set a duration. Review active silences weekly: amtool silence query --alertmanager.url=http://alertmanager:9093

Debugging Alert Routing

# Test which receiver an alert would match
amtool config routes test --config.file=/etc/alertmanager/alertmanager.yml \
  severity=critical team=platform alertname=HighErrorRate

# Show the full routing tree
amtool config routes show --config.file=/etc/alertmanager/alertmanager.yml

Flashcard Check: Alerting Pipeline

Question Answer
What does for: 5m do in an alert rule? The condition must be continuously true for 5 minutes before the alert fires. It debounces transient spikes.
What's the difference between relabel_configs and metric_relabel_configs? relabel_configs runs before scraping (controls what targets get scraped). metric_relabel_configs runs after scraping (controls what metrics get stored).
How does Alertmanager inhibition work? When a higher-severity alert fires, it suppresses matching lower-severity alerts. Example: NodeDown (critical) suppresses pod alerts (warning) on the same node.
How do you test Alertmanager routing without waiting for a real alert? amtool config routes test with the alert labels you want to test.

Part 7: Storage, Retention, and Long-Term Solutions

Prometheus was designed for real-time monitoring, not as a data warehouse. Its local TSDB has limits.

Retention

# Default: 15 days
# Set via CLI flags:
--storage.tsdb.retention.time=60d
--storage.tsdb.retention.size=100GB   # whichever limit hits first

Gotcha: If your SLO is measured over a 30-day window but your retention is 15 days, your error budget calculations use incomplete data. Set retention to at least 2x your longest SLO window.

Storage Sizing

storage = series_count x samples_per_day x retention_days x bytes_per_sample

Example: 500,000 series, 15s scrape interval, 15 days retention
= 500,000 x (86,400 / 15) x 15 x 1.7 bytes
= 500,000 x 5,760 x 15 x 1.7
~ 73 GB

Prometheus compresses samples to about 1.5--2 bytes each. That's remarkably efficient, but at millions of series it adds up fast.

Remote Write and Long-Term Backends

For retention beyond weeks, push data to a remote backend:

# prometheus.yml
remote_write:
  - url: "http://mimir:9009/api/v1/push"
    queue_config:
      max_samples_per_send: 5000
      batch_send_deadline: 5s
Backend Architecture Key Feature
Thanos Sidecar per Prometheus + object storage (S3/GCS) Global query view, deduplication, downsampling
Cortex Multi-tenant, horizontally scalable Managed-service compatible, HA
Mimir Cortex successor (Grafana Labs) Better performance, simpler ops, native multi-tenancy

Under the Hood: Thanos works by attaching a sidecar to each Prometheus instance. The sidecar uploads compacted blocks to object storage (S3, GCS) and exposes a gRPC Store API. A Thanos Querier federates queries across all Prometheus instances and object storage, deduplicating samples from HA pairs. This means you can run two Prometheus instances scraping the same targets (for redundancy) and Thanos handles the overlap.

Federation (Simpler, Smaller Scale)

A top-level Prometheus scrapes /federate from leaf instances:

- job_name: "federate-cluster-east"
  honor_labels: true
  metrics_path: /federate
  params:
    match[]:
      - 'job:http_requests_total:rate5m'    # Only federate recording rules
      - 'job:http_error_rate:ratio_5m'
      - 'up'
  static_configs:
    - targets: ["prometheus-east.internal:9090"]

Gotcha: Never federate raw metrics with match[]={__name__=~".+"}. Each federation scrape evaluates the match selectors -- on a Prometheus with 1M+ series, that's 10--30 seconds of CPU per scrape. Federate recording rules only. For full-fidelity cross-cluster queries, use Thanos or Mimir.


Part 8: High Availability

A single Prometheus is a single point of failure for your monitoring. Here's how to fix that.

The Simple Approach: Two Identical Prometheus Instances

Run two Prometheus servers with the same config, scraping the same targets. Both independently collect and store all data. If one goes down, the other continues.

The problem: queries hit one instance, and their data diverges slightly (scrape timing differences, brief outages on one side). You need a query layer that deduplicates.

Thanos for Deduplication

┌─────────────┐  ┌─────────────┐
│ Prometheus-0 │  │ Prometheus-1 │  (same config, same targets)
│   + Sidecar  │  │   + Sidecar  │
└──────┬───────┘  └──────┬───────┘
       │                 │
       └────────┬────────┘
       ┌────────▼────────┐
       │  Thanos Querier  │  ← deduplicates overlapping samples
       └─────────────────┘
       ┌────────▼────────┐
       │  Grafana         │
       └─────────────────┘

Thanos Querier knows that both Prometheus instances are replicas (via the replica label) and deduplicates their samples. Grafana points at Thanos Querier instead of Prometheus directly. If Prometheus-0 has a gap, Prometheus-1 fills it in.

Interview Bridge: "How do you make Prometheus highly available?" is a common interview question. The answer isn't "cluster Prometheus" (it doesn't cluster). It's "run two independent instances and deduplicate with Thanos or a similar query layer."


Exercises

Exercise 1: Read the TSDB Status (2 minutes)

If you have a Prometheus instance running:

curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.headStats'

Questions: - How many active series does your instance have? - What's the ratio of chunks to series? (Roughly 4:1 is normal for a 2-hour head block with 15s scrapes)

Don't have a running Prometheus? Here's what to look for. `numSeries` is the count of active time series in the head block. Multiply by roughly 2 KB per series for a rough memory estimate of the head block's contribution. `chunkCount` is the number of in-memory compressed sample chunks. Each chunk holds ~120 samples.

Exercise 2: Find Your Cardinality Hogs (5 minutes)

curl -s http://localhost:9090/api/v1/status/tsdb | \
  jq -r '.data.seriesCountByMetricName[:10][] | "\(.value)\t\(.name)"' | sort -rn
  • Identify the top 3 metrics by series count.
  • For each, explain why they have that many series (how many label dimensions, how many values per dimension).
  • Is any metric suspiciously high?
Hint Histograms naturally have more series (one per bucket). A histogram with 10 buckets, across 50 pods, is 500+ series just from the buckets. That's expected. What's NOT expected is a single metric with 100K+ series -- investigate its labels.

Exercise 3: Write a Cardinality Alert (10 minutes)

Write a Prometheus alerting rule that fires when any single metric name has more than 100,000 time series. Include: - A for duration - A severity label - An annotation with a summary that includes the metric name and series count

Solution
groups:
  - name: meta-monitoring
    rules:
      - alert: CardinalityExplosion
        expr: count by (__name__) ({__name__=~".+"}) > 100000
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Metric {{ $labels.__name__ }} has {{ $value }} series -- investigate label cardinality"
          runbook: "https://wiki.internal/runbooks/cardinality-explosion"

Exercise 4: Design Histogram Buckets (10 minutes)

Your service has an SLO of "99.5% of requests complete in under 300ms." The current histogram uses default buckets: [.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10].

  1. Why are the default buckets bad for this SLO?
  2. Design better bucket boundaries.
  3. How many additional time series do your new buckets create per label combination?
Solution 1. The default buckets have no boundary at 300ms. The nearest are 250ms and 500ms. `histogram_quantile` linearly interpolates, so the p99.5 calculation between these wide boundaries is inaccurate. 2. Better buckets:
buckets=[.005, .01, .025, .05, .1, .2, .25, .3, .35, .5, 1, 2.5, 5, 10]
#                              ^^^  ^^^  ^^^  ^^^
#                    Tight coverage around the 300ms SLO boundary
3. Three additional buckets (200ms, 350ms added; 300ms added) = 3 additional time series per unique label combination.

Cheat Sheet

TSDB Diagnostics

Command What it tells you
curl prometheus:9090/api/v1/status/tsdb \| jq '.data.headStats' Active series, chunk count, head block time range
curl prometheus:9090/api/v1/status/tsdb \| jq '.data.seriesCountByMetricName[:10]' Top 10 metrics by series count
curl prometheus:9090/api/v1/status/tsdb \| jq '.data.labelValueCountByLabelName[:10]' Top 10 labels by unique value count
du -sh /prometheus/wal/ WAL size (>5 GB = investigate)
du -sh /prometheus/chunks_head/ Head block chunks on disk

PromQL Quick Reference

Pattern Example
Counter rate rate(http_requests_total[5m])
Error ratio sum(rate(errors[5m])) / sum(rate(total[5m]))
p99 latency histogram_quantile(0.99, sum by (le) (rate(duration_bucket[5m])))
Average latency rate(duration_sum[5m]) / rate(duration_count[5m])
Disk full prediction predict_linear(node_filesystem_free_bytes[6h], 24*3600) < 0
Missing metric absent(up{job="my-service"})

Metric Types at a Glance

Type Goes up/down? Use rate() on it? Example
Counter Up only (resets on restart) Yes, always http_requests_total
Gauge Both No -- use deriv() or threshold node_memory_MemAvailable_bytes
Histogram N/A (buckets are counters) Yes, on the buckets http_request_duration_seconds_bucket
Summary N/A (quantiles are gauges) No rpc_duration_seconds{quantile="0.99"}

Cardinality Rules of Thumb

Guideline Number
Healthy series per service 1,000--5,000
Max unique values per label ~100 (strongly bounded)
Alert threshold for a single metric >50,000 series
Series per histogram bucket per label combo 1

Alerting Pipeline

Alert rule (Prometheus) → for duration → Alertmanager
  → route matching (label tree) → grouping (batch related alerts)
  → inhibition (suppress downstream) → receiver (Slack, PagerDuty)
  → silence check → deliver or mute

Takeaways

  1. Prometheus memory is driven by active time series count. The head block keeps every active series in RAM. More series = more memory. The TSDB status API is your diagnostic starting point.

  2. Cardinality is multiplicative, not additive. One unbounded label (user IDs, raw paths) combined with histogram buckets creates millions of series. Never use unbounded values as Prometheus labels.

  3. rate() for alerting, irate() for dashboards. rate() smooths over the full range and handles counter resets. irate() reacts instantly but is noisy and produces false spikes at reset boundaries.

  4. Histogram bucket boundaries must include your SLO thresholds. histogram_quantile interpolates linearly between buckets. No bucket near your SLO = inaccurate percentile calculations.

  5. The alerting pipeline has four distinct failure points: rule evaluation, Alertmanager routing, receiver delivery, and silence/inhibition misconfiguration. Test each one independently.

  6. Prometheus doesn't cluster -- it replicates. For high availability, run two independent instances and deduplicate with Thanos. For long-term storage, use remote write to Mimir, Thanos, or Cortex.