Prometheus: Under the Hood
- lesson
- prometheus-tsdb
- promql
- cardinality
- service-discovery
- alerting-pipeline
- long-term-storage
- high-availability
- l2 ---# Prometheus -- Under the Hood
Topics: Prometheus TSDB, PromQL, cardinality, service discovery, alerting pipeline, long-term storage, high availability Level: L2 (Operations) Time: 90--120 minutes Strategy: Build-up + incident-driven
The Mission¶
It's 2 AM. PagerDuty fires: Prometheus is using 80 GB of RAM on a box with 96 GB. Queries to Grafana are timing out. The on-call Slack channel is full of "dashboards are blank" messages. Your monitoring system -- the thing that watches everything else -- is about to fall over.
You need to figure out why Prometheus is eating memory, stop the bleeding, and make sure it never happens again. To do that, you need to understand how Prometheus actually works inside -- not the marketing overview, but the storage engine, the query model, and the places where things go wrong.
Let's build that understanding from the ground up, then use it to save the night.
Part 1: The TSDB -- Where Your Metrics Live¶
Every sample Prometheus scrapes lands in its local time-series database. Understanding this storage engine is the difference between "restart it and pray" and "I know exactly what's wrong."
The Write Path: WAL, Head Block, Persistent Blocks¶
When Prometheus scrapes a target, the sample doesn't go straight to disk as a nice compressed file. It takes a journey:
Step 1: The WAL. Every incoming sample is first appended to the Write-Ahead Log -- a sequential, append-only file on disk. This is your crash recovery insurance. If Prometheus dies mid-scrape, it replays the WAL on startup to recover samples that hadn't been persisted yet.
# The WAL lives here
ls -lh /prometheus/wal/
# You'll see numbered segment files: 00000001, 00000002, ...
# Each segment is up to 128 MB
Under the Hood: The WAL design is borrowed from databases like PostgreSQL and LevelDB. The idea: sequential writes to an append-only log are fast and durable. Random writes to a structured database are slow. So write fast first (WAL), structure later (compaction).
Step 2: The Head Block. Samples accumulate in memory in the "head block" -- a
structure optimized for recent data that's still being written to. The head block covers
roughly the last 2 hours of data (configurable via --storage.tsdb.min-block-duration).
This is where your RAM goes. Every active time series has an in-memory representation in the head block. More series = more memory.
Step 3: Compaction. Every 2 hours, the head block is "cut" -- its contents are compressed and written to a persistent block on disk. Prometheus also merges smaller blocks into larger ones over time (compaction), which improves query performance and reduces disk usage.
Time →
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────────────────┐
│ Block 1 │ │ Block 2 │ │ Block 3 │ │ Head Block │
│ 0-2h │ │ 2-4h │ │ 4-6h │ │ (in memory) │
│ (disk) │ │ (disk) │ │ (disk) │ │ 6h-now │
└─────────┘ └─────────┘ └─────────┘ └──────────────────┘
↑ WAL backs this up
Gotcha: The WAL can grow very large during high churn (lots of new series appearing and disappearing). Kubernetes environments with frequent pod churn are especially vulnerable. A 10 GB WAL is a sign something is wrong. Monitor it:
du -sh /prometheus/wal/
Why This Matters for Our Incident¶
At 80 GB of RAM, the head block is enormous. That means either: 1. There are a massive number of active time series, or 2. The head block hasn't been compacted and covers too long a time range
In practice, it's almost always #1. Let's find out how many series we have.
{
"numSeries": 12847291,
"numLabelPairs": 38541873,
"chunkCount": 51389164,
"minTime": 1711036800000,
"maxTime": 1711044000000
}
12.8 million active time series. For a cluster running a few hundred services, a healthy number is 200K--1M. We're 10x over. Something is creating series like they're free.
Flashcard Check: TSDB Basics¶
| Question | Answer |
|---|---|
| What are the three stages of Prometheus's write path? | WAL (append-only log on disk) -> Head Block (in memory) -> Persistent Block (compressed on disk) |
| Why does the WAL exist? | Crash recovery. If Prometheus dies, it replays the WAL to recover uncompacted samples. |
| What determines Prometheus's memory usage? | Primarily the number of active time series in the head block. More series = more RAM. |
| How often does the head block get compacted to disk? | Approximately every 2 hours (controlled by --storage.tsdb.min-block-duration). |
Part 2: Metric Types -- The Building Blocks¶
Before we hunt the cardinality bomb, you need to know what kinds of metrics exist and how they behave. There are four types, and picking the wrong one is a common source of confusion.
Counter¶
A number that only goes up. Resets to zero when the process restarts.
You never alert on the raw value. A counter of 145,232 tells you nothing. The rate of change tells you everything:
rate(http_requests_total[5m]) # requests per second, averaged over 5 minutes
increase(http_requests_total[1h]) # total increase over the past hour
Name Origin: The term "counter" in Prometheus comes from the same concept in hardware performance counters -- CPU registers that only increment when an event occurs (cache miss, branch misprediction). You never read the raw counter; you read the difference between two readings.
The Counter Reset Problem¶
What happens when a service restarts and the counter drops from 150,000 to 0?
rate() handles this. It detects when a value decreases (which should never happen for a
counter) and assumes a reset occurred. It calculates the rate using only the post-reset
samples.
But irate() -- which uses only the last two data points -- can produce a brief spike at
the reset boundary because of interpolation artifacts.
| Function | How it works | Use for |
|---|---|---|
rate() |
Average rate across the full range | Alerting, recording rules |
irate() |
Instantaneous rate from last two points | Dashboards (responsive but noisy) |
increase() |
Total increase over range | "How many errors in the last hour?" |
Remember:
rate()for alerting,irate()for dashboards. If you alert onirate(), counter resets will page you at 3 AM for nothing.
Gauge¶
A value that goes up and down. Current temperature, memory in use, queue depth.
# Alert when memory is low
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
# Is the disk filling up? (rate of change of a gauge)
deriv(node_filesystem_free_bytes{mountpoint="/"}[1h])
# Predict when disk hits zero (linear extrapolation)
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0
Gotcha: Never use
rate()on a gauge.rate()assumes values only go up and treats decreases as counter resets. On a gauge, a decrease from 8 to 2 looks like a "reset" and produces garbage. Usederiv()for rate of change on gauges.
Histogram¶
This is where things get interesting (and where cardinality gets expensive). A histogram counts observations in pre-defined buckets.
http_request_duration_seconds_bucket{handler="/api/users", le="0.005"} 12000
http_request_duration_seconds_bucket{handler="/api/users", le="0.01"} 14500
http_request_duration_seconds_bucket{handler="/api/users", le="0.025"} 15200
http_request_duration_seconds_bucket{handler="/api/users", le="0.05"} 15400
http_request_duration_seconds_bucket{handler="/api/users", le="0.1"} 15450
http_request_duration_seconds_bucket{handler="/api/users", le="0.25"} 15480
http_request_duration_seconds_bucket{handler="/api/users", le="0.5"} 15490
http_request_duration_seconds_bucket{handler="/api/users", le="1"} 15495
http_request_duration_seconds_bucket{handler="/api/users", le="+Inf"} 15500
http_request_duration_seconds_sum{handler="/api/users"} 103.42
http_request_duration_seconds_count{handler="/api/users"} 15500
That's 11 time series for one handler on one instance. Ten buckets plus _sum and
_count minus the shared count. Across 50 handlers and 20 pods, that's 11,000 series
from a single histogram metric.
# p99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# p50 (median)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
# Average latency (no buckets needed)
rate(http_request_duration_seconds_sum[5m])
/ rate(http_request_duration_seconds_count[5m])
Why Bucket Boundaries Matter¶
histogram_quantile() linearly interpolates between bucket boundaries. If your SLO says
"99% of requests under 200ms" but your nearest buckets are le="0.1" (100ms) and
le="0.25" (250ms), the p99 calculation is an approximation that can be wildly off.
# Fix: add buckets at your SLO boundaries
from prometheus_client import Histogram
request_latency = Histogram(
'http_request_duration_seconds',
'Request latency',
['method', 'handler'],
buckets=[.005, .01, .025, .05, .1, .15, .2, .25, .3, .5, 1, 2.5, 5, 10]
# ^^^ ^^^ ^^^
# Added around the 200ms SLO boundary
)
Mental Model: Think of histogram buckets like a ruler. If your ruler only has marks at 1cm and 10cm, measuring something 3.7cm long gives you a bad answer. Put marks where you actually need precision.
Summary¶
Pre-computes quantiles client-side. Cheaper for Prometheus to store, but you cannot aggregate summaries across instances. The average of p99 values from 10 instances is not the p99 of the combined distribution. That's just math.
| Histogram | Summary | |
|---|---|---|
| Aggregation | Yes (aggregate buckets, then compute quantile) | No (pre-computed quantiles can't be combined) |
| Bucket config | Server-side, changeable without redeploy | Client-side, fixed at instrumentation time |
| Cost | More series (one per bucket) | Fewer series |
| Use when | You need cross-instance percentiles (almost always) | Per-instance quantiles and you'll never aggregate |
Default choice: histogram. Unless you have a specific reason, always use histograms.
Part 3: The Cardinality Bomb -- Diagnosing the Incident¶
Back to our 2 AM crisis. 12.8 million series. Let's find the offender.
Step 1: Find the Top Metrics¶
curl -s http://prometheus:9090/api/v1/status/tsdb | \
jq '[.data.seriesCountByMetricName[:10][] | {name: .name, count: .value}]'
[
{ "name": "http_request_duration_seconds_bucket", "count": 8547000 },
{ "name": "http_request_duration_seconds_count", "count": 854700 },
{ "name": "http_request_duration_seconds_sum", "count": 854700 },
{ "name": "node_cpu_seconds_total", "count": 128000 },
{ "name": "container_memory_working_set_bytes", "count": 95000 }
]
8.5 million series from one histogram's bucket metric. That's our cardinality bomb.
Step 2: Find the Exploding Label¶
# Which label has the most unique values?
curl -s http://prometheus:9090/api/v1/status/tsdb | \
jq '[.data.labelValueCountByLabelName[:10][] | {label: .name, values: .value}]'
[
{ "label": "request_path", "values": 847000 },
{ "label": "pod", "values": 1200 },
{ "label": "le", "values": 11 },
{ "label": "method", "values": 5 }
]
847,000 unique values for request_path. Someone instrumented their HTTP middleware to use
the raw URL path -- /api/v1/users/12345, /api/v1/users/67890 -- instead of the route
template /api/v1/users/:id.
Mental Model: Cardinality is multiplicative. Labels don't add -- they multiply. 10 histogram buckets x 5 methods x 847,000 paths x 2 instances = 84.7 million potential series. The
request_pathlabel alone turned a 1,000-series metric into a multi-million series monster.
Step 3: Find Which Service Is Doing This¶
curl -s 'http://prometheus:9090/api/v1/query?query=count(http_request_duration_seconds_bucket) by (job)' | \
jq '[.data.result[] | {job: .metric.job, series: .value[1]}] | sort_by(-.series)'
[
{ "job": "user-service", "series": "8200000" },
{ "job": "api-gateway", "series": "45000" },
{ "job": "auth-service", "series": "12000" }
]
It's user-service. 8.2 million series from one service.
Step 4: Stop the Bleeding (Now)¶
The instrumentation fix needs a code deploy. That's hours away. We stop the ingestion immediately with a metric relabel config:
# Add to the user-service scrape config in prometheus.yml
metric_relabel_configs:
- source_labels: [__name__, request_path]
regex: 'http_request_duration_seconds_(bucket|count|sum);/api/v1/users/[0-9]+'
action: drop
# Reload the config (Prometheus must have --web.enable-lifecycle)
curl -XPOST http://prometheus:9090/-/reload
Dropped series stop being ingested on the next scrape cycle. Memory won't drop immediately -- the head block keeps existing series until the next compaction -- but growth stops.
War Story: At one company, a single developer added a
trace_idlabel to a request counter during a debugging session and forgot to remove it. Each request generated a unique trace ID. Within 6 hours, the metric had created over 100,000 time series. Prometheus went from 4 GB to 35 GB of RAM, queries started timing out, and every Grafana dashboard went blank. The monitoring system that was supposed to detect problems became the problem. The fix was a two-line metric_relabel_config -- but finding the cause took 45 minutes of panic at 3 AM. The postmortem action item: a standing alert on cardinality growth.
Step 5: Prevent It From Happening Again¶
# Alert when any single metric has too many series
groups:
- name: cardinality-watchdog
rules:
- alert: HighCardinalityMetric
expr: count by (__name__) ({__name__=~".+"}) > 50000
for: 10m
labels:
severity: warning
annotations:
summary: "Metric {{ $labels.__name__ }} has {{ $value }} series"
runbook: "https://wiki.internal/runbooks/cardinality-explosion"
Flashcard Check: Cardinality¶
| Question | Answer |
|---|---|
| What makes label cardinality dangerous? | Cardinality is multiplicative. Each label's unique values multiply with every other label's values to determine total series count. |
| Name three label values that should never be Prometheus labels. | User IDs, request IDs/trace IDs, UUIDs, email addresses, raw URL paths -- anything unbounded. |
| How do you find the top metrics by series count? | curl http://prometheus:9090/api/v1/status/tsdb and inspect seriesCountByMetricName. |
| How do you stop a cardinality explosion without a code deploy? | Add metric_relabel_configs with action: drop to the scrape config and reload Prometheus. |
Part 4: PromQL Deep Dive¶
Now that we've saved the night, let's go deeper on the query language. PromQL is deceptively simple -- until you need to write a real alert.
rate() vs irate() -- When It Matters¶
Both compute per-second rates from counters. The difference is in how much data they use:
rate() averages across the entire range, smoothing out spikes. irate() reacts
instantly to the latest change but is noisy.
# Smooth, stable -- good for alerting
rate(http_requests_total{job="api-server"}[5m])
# Responsive, spiky -- good for dashboards
irate(http_requests_total{job="api-server"}[5m])
Gotcha:
rate()needs at least two samples in the range window. With a 15-second scrape interval,rate(metric[30s])gives you exactly two samples -- and if one scrape is late, you get zero. Use a range of at least 4x your scrape interval. The safe default for 15-second scrapes:[1m]minimum,[5m]for alerting.
Aggregation Operators¶
# Total request rate across all instances
sum(rate(http_requests_total[5m]))
# Grouped by status code
sum by (status) (rate(http_requests_total[5m]))
# Everything EXCEPT the instance label
sum without (instance) (rate(http_requests_total[5m]))
# Top 5 handlers by request rate
topk(5, sum by (handler) (rate(http_requests_total[5m])))
# How many targets are up?
count(up == 1)
histogram_quantile() -- The Function Everyone Gets Wrong¶
# p99 latency across all instances (correct)
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
The by (le) is critical. histogram_quantile needs the le (less-than-or-equal)
label to know the bucket boundaries. If you aggregate away le, you get garbage.
Want per-handler p99?
The rule: always keep le in your by clause when using histogram_quantile.
Recording Rules: Pre-Computing Expensive Queries¶
If that histogram_quantile query takes 8 seconds to evaluate, pre-compute it:
groups:
- name: latency-recording
interval: 30s
rules:
- record: job:http_request_duration_seconds:p99_5m
expr: |
histogram_quantile(0.99,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- record: job:http_error_rate:ratio_5m
expr: |
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
Trivia: The naming convention
level:metric:operations(likejob:http_requests_total:rate5m) was established in the Prometheus documentation and follows a pattern borrowed from Borgmon, Google's internal monitoring system that inspired Prometheus. The convention makes it immediately clear what aggregation level and operations were applied.
Part 5: Service Discovery -- How Prometheus Finds Targets¶
Static configs work for 5 servers. In Kubernetes with pods spinning up and down every minute, you need dynamic discovery.
Kubernetes Service Discovery¶
scrape_configs:
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with the annotation prometheus.io/scrape: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use the pod's prometheus.io/port annotation
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Carry pod metadata as labels
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
The Prometheus Operator Way (ServiceMonitors)¶
In most Kubernetes clusters, you don't write raw scrape configs. The Prometheus Operator manages everything via CRDs:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-service
labels:
team: platform
spec:
selector:
matchLabels:
app: api-service
endpoints:
- port: metrics
interval: 15s
path: /metrics
namespaceSelector:
matchNames:
- production
Gotcha: ServiceMonitor not being picked up is the #1 Prometheus Operator debugging question. It's a two-level selector problem: 1. The Prometheus CR selects which ServiceMonitors to load (
serviceMonitorSelector) 2. Each ServiceMonitor selects which Services to scrape (selector)Missing either level = zero targets, zero errors. Check both:
Other Discovery Mechanisms¶
| Mechanism | Use case |
|---|---|
static_configs |
Small, fixed fleets |
kubernetes_sd_configs |
Kubernetes pods, services, nodes |
ec2_sd_configs |
AWS EC2 instances by tag |
consul_sd_configs |
Consul service registry |
file_sd_configs |
JSON/YAML files (good for custom scripts that output targets) |
dns_sd_configs |
DNS SRV records |
Relabeling: The Swiss Army Knife¶
Relabeling transforms labels at two stages:
relabel_configs-- before scraping (controls what gets scraped)metric_relabel_configs-- after scraping (controls what gets stored)
# Drop all go_gc internal metrics to save cardinality
metric_relabel_configs:
- source_labels: [__name__]
regex: "go_gc_.*"
action: drop
# Remove an unbounded label
- regex: "request_id"
action: labeldrop
Part 6: The Alerting Pipeline¶
Prometheus doesn't send alerts to Slack directly. The pipeline has distinct stages, and each one can fail silently if misconfigured.
Alert Rules (Prometheus) → Alertmanager → Receivers (Slack, PagerDuty, email)
↓ ↓
"Is condition true "Route, group,
for 5 minutes?" deduplicate, silence"
Alert Rules¶
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "High 5xx error rate: {{ $value | humanizePercentage }}"
runbook: "https://wiki.internal/runbooks/high-error-rate"
The for: 5m is your debounce. The condition must be true for 5 continuous minutes
before the alert fires. Without it, a single bad scrape pages you at 3 AM.
Trivia: The Three Mile Island nuclear accident in 1979 was worsened by over 100 simultaneous alarms, many contradictory. The alarm printer fell 2 hours behind real-time. Operators couldn't distinguish critical warnings from noise. This incident became a foundational case study in alarm management and directly influenced how modern alerting systems use severity levels, grouping, and deduplication.
Alertmanager: Routing, Grouping, Inhibition¶
# alertmanager.yml
route:
receiver: default-slack
group_by: [alertname, cluster, namespace]
group_wait: 30s # Wait before sending the first notification
group_interval: 5m # Wait between subsequent notifications for same group
repeat_interval: 4h # Don't re-notify for the same alert before this
routes:
- match:
severity: critical
receiver: pagerduty-critical
continue: true # Also match the next route
- match:
severity: critical
receiver: slack-critical
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: [alertname, cluster, namespace]
Grouping batches related alerts. Without it, 50 pods OOMKilling in one namespace sends 50 separate Slack messages.
Inhibition suppresses downstream alerts. If NodeDown fires critical, suppress all
warning-level pod alerts on that node -- the pods can't run on a dead node.
Silences temporarily mute alerts during maintenance:
amtool silence add alertname="DiskFillingUp" instance="node3:9100" \
--comment="Replacing disk on node3" --duration=4h
Gotcha: Unbounded silences (no expiry) are the #1 cause of missed incidents. Always set a duration. Review active silences weekly:
amtool silence query --alertmanager.url=http://alertmanager:9093
Debugging Alert Routing¶
# Test which receiver an alert would match
amtool config routes test --config.file=/etc/alertmanager/alertmanager.yml \
severity=critical team=platform alertname=HighErrorRate
# Show the full routing tree
amtool config routes show --config.file=/etc/alertmanager/alertmanager.yml
Flashcard Check: Alerting Pipeline¶
| Question | Answer |
|---|---|
What does for: 5m do in an alert rule? |
The condition must be continuously true for 5 minutes before the alert fires. It debounces transient spikes. |
What's the difference between relabel_configs and metric_relabel_configs? |
relabel_configs runs before scraping (controls what targets get scraped). metric_relabel_configs runs after scraping (controls what metrics get stored). |
| How does Alertmanager inhibition work? | When a higher-severity alert fires, it suppresses matching lower-severity alerts. Example: NodeDown (critical) suppresses pod alerts (warning) on the same node. |
| How do you test Alertmanager routing without waiting for a real alert? | amtool config routes test with the alert labels you want to test. |
Part 7: Storage, Retention, and Long-Term Solutions¶
Prometheus was designed for real-time monitoring, not as a data warehouse. Its local TSDB has limits.
Retention¶
# Default: 15 days
# Set via CLI flags:
--storage.tsdb.retention.time=60d
--storage.tsdb.retention.size=100GB # whichever limit hits first
Gotcha: If your SLO is measured over a 30-day window but your retention is 15 days, your error budget calculations use incomplete data. Set retention to at least 2x your longest SLO window.
Storage Sizing¶
storage = series_count x samples_per_day x retention_days x bytes_per_sample
Example: 500,000 series, 15s scrape interval, 15 days retention
= 500,000 x (86,400 / 15) x 15 x 1.7 bytes
= 500,000 x 5,760 x 15 x 1.7
~ 73 GB
Prometheus compresses samples to about 1.5--2 bytes each. That's remarkably efficient, but at millions of series it adds up fast.
Remote Write and Long-Term Backends¶
For retention beyond weeks, push data to a remote backend:
# prometheus.yml
remote_write:
- url: "http://mimir:9009/api/v1/push"
queue_config:
max_samples_per_send: 5000
batch_send_deadline: 5s
| Backend | Architecture | Key Feature |
|---|---|---|
| Thanos | Sidecar per Prometheus + object storage (S3/GCS) | Global query view, deduplication, downsampling |
| Cortex | Multi-tenant, horizontally scalable | Managed-service compatible, HA |
| Mimir | Cortex successor (Grafana Labs) | Better performance, simpler ops, native multi-tenancy |
Under the Hood: Thanos works by attaching a sidecar to each Prometheus instance. The sidecar uploads compacted blocks to object storage (S3, GCS) and exposes a gRPC Store API. A Thanos Querier federates queries across all Prometheus instances and object storage, deduplicating samples from HA pairs. This means you can run two Prometheus instances scraping the same targets (for redundancy) and Thanos handles the overlap.
Federation (Simpler, Smaller Scale)¶
A top-level Prometheus scrapes /federate from leaf instances:
- job_name: "federate-cluster-east"
honor_labels: true
metrics_path: /federate
params:
match[]:
- 'job:http_requests_total:rate5m' # Only federate recording rules
- 'job:http_error_rate:ratio_5m'
- 'up'
static_configs:
- targets: ["prometheus-east.internal:9090"]
Gotcha: Never federate raw metrics with
match[]={__name__=~".+"}. Each federation scrape evaluates the match selectors -- on a Prometheus with 1M+ series, that's 10--30 seconds of CPU per scrape. Federate recording rules only. For full-fidelity cross-cluster queries, use Thanos or Mimir.
Part 8: High Availability¶
A single Prometheus is a single point of failure for your monitoring. Here's how to fix that.
The Simple Approach: Two Identical Prometheus Instances¶
Run two Prometheus servers with the same config, scraping the same targets. Both independently collect and store all data. If one goes down, the other continues.
The problem: queries hit one instance, and their data diverges slightly (scrape timing differences, brief outages on one side). You need a query layer that deduplicates.
Thanos for Deduplication¶
┌─────────────┐ ┌─────────────┐
│ Prometheus-0 │ │ Prometheus-1 │ (same config, same targets)
│ + Sidecar │ │ + Sidecar │
└──────┬───────┘ └──────┬───────┘
│ │
└────────┬────────┘
│
┌────────▼────────┐
│ Thanos Querier │ ← deduplicates overlapping samples
└─────────────────┘
│
┌────────▼────────┐
│ Grafana │
└─────────────────┘
Thanos Querier knows that both Prometheus instances are replicas (via the replica label)
and deduplicates their samples. Grafana points at Thanos Querier instead of Prometheus
directly. If Prometheus-0 has a gap, Prometheus-1 fills it in.
Interview Bridge: "How do you make Prometheus highly available?" is a common interview question. The answer isn't "cluster Prometheus" (it doesn't cluster). It's "run two independent instances and deduplicate with Thanos or a similar query layer."
Exercises¶
Exercise 1: Read the TSDB Status (2 minutes)¶
If you have a Prometheus instance running:
Questions: - How many active series does your instance have? - What's the ratio of chunks to series? (Roughly 4:1 is normal for a 2-hour head block with 15s scrapes)
Don't have a running Prometheus? Here's what to look for.
`numSeries` is the count of active time series in the head block. Multiply by roughly 2 KB per series for a rough memory estimate of the head block's contribution. `chunkCount` is the number of in-memory compressed sample chunks. Each chunk holds ~120 samples.Exercise 2: Find Your Cardinality Hogs (5 minutes)¶
curl -s http://localhost:9090/api/v1/status/tsdb | \
jq -r '.data.seriesCountByMetricName[:10][] | "\(.value)\t\(.name)"' | sort -rn
- Identify the top 3 metrics by series count.
- For each, explain why they have that many series (how many label dimensions, how many values per dimension).
- Is any metric suspiciously high?
Hint
Histograms naturally have more series (one per bucket). A histogram with 10 buckets, across 50 pods, is 500+ series just from the buckets. That's expected. What's NOT expected is a single metric with 100K+ series -- investigate its labels.Exercise 3: Write a Cardinality Alert (10 minutes)¶
Write a Prometheus alerting rule that fires when any single metric name has more than
100,000 time series. Include:
- A for duration
- A severity label
- An annotation with a summary that includes the metric name and series count
Solution
groups:
- name: meta-monitoring
rules:
- alert: CardinalityExplosion
expr: count by (__name__) ({__name__=~".+"}) > 100000
for: 15m
labels:
severity: warning
annotations:
summary: "Metric {{ $labels.__name__ }} has {{ $value }} series -- investigate label cardinality"
runbook: "https://wiki.internal/runbooks/cardinality-explosion"
Exercise 4: Design Histogram Buckets (10 minutes)¶
Your service has an SLO of "99.5% of requests complete in under 300ms." The current
histogram uses default buckets: [.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10].
- Why are the default buckets bad for this SLO?
- Design better bucket boundaries.
- How many additional time series do your new buckets create per label combination?
Solution
1. The default buckets have no boundary at 300ms. The nearest are 250ms and 500ms. `histogram_quantile` linearly interpolates, so the p99.5 calculation between these wide boundaries is inaccurate. 2. Better buckets: 3. Three additional buckets (200ms, 350ms added; 300ms added) = 3 additional time series per unique label combination.Cheat Sheet¶
TSDB Diagnostics¶
| Command | What it tells you |
|---|---|
curl prometheus:9090/api/v1/status/tsdb \| jq '.data.headStats' |
Active series, chunk count, head block time range |
curl prometheus:9090/api/v1/status/tsdb \| jq '.data.seriesCountByMetricName[:10]' |
Top 10 metrics by series count |
curl prometheus:9090/api/v1/status/tsdb \| jq '.data.labelValueCountByLabelName[:10]' |
Top 10 labels by unique value count |
du -sh /prometheus/wal/ |
WAL size (>5 GB = investigate) |
du -sh /prometheus/chunks_head/ |
Head block chunks on disk |
PromQL Quick Reference¶
| Pattern | Example |
|---|---|
| Counter rate | rate(http_requests_total[5m]) |
| Error ratio | sum(rate(errors[5m])) / sum(rate(total[5m])) |
| p99 latency | histogram_quantile(0.99, sum by (le) (rate(duration_bucket[5m]))) |
| Average latency | rate(duration_sum[5m]) / rate(duration_count[5m]) |
| Disk full prediction | predict_linear(node_filesystem_free_bytes[6h], 24*3600) < 0 |
| Missing metric | absent(up{job="my-service"}) |
Metric Types at a Glance¶
| Type | Goes up/down? | Use rate() on it? |
Example |
|---|---|---|---|
| Counter | Up only (resets on restart) | Yes, always | http_requests_total |
| Gauge | Both | No -- use deriv() or threshold |
node_memory_MemAvailable_bytes |
| Histogram | N/A (buckets are counters) | Yes, on the buckets | http_request_duration_seconds_bucket |
| Summary | N/A (quantiles are gauges) | No | rpc_duration_seconds{quantile="0.99"} |
Cardinality Rules of Thumb¶
| Guideline | Number |
|---|---|
| Healthy series per service | 1,000--5,000 |
| Max unique values per label | ~100 (strongly bounded) |
| Alert threshold for a single metric | >50,000 series |
| Series per histogram bucket per label combo | 1 |
Alerting Pipeline¶
Alert rule (Prometheus) → for duration → Alertmanager
→ route matching (label tree) → grouping (batch related alerts)
→ inhibition (suppress downstream) → receiver (Slack, PagerDuty)
→ silence check → deliver or mute
Takeaways¶
-
Prometheus memory is driven by active time series count. The head block keeps every active series in RAM. More series = more memory. The TSDB status API is your diagnostic starting point.
-
Cardinality is multiplicative, not additive. One unbounded label (user IDs, raw paths) combined with histogram buckets creates millions of series. Never use unbounded values as Prometheus labels.
-
rate()for alerting,irate()for dashboards.rate()smooths over the full range and handles counter resets.irate()reacts instantly but is noisy and produces false spikes at reset boundaries. -
Histogram bucket boundaries must include your SLO thresholds.
histogram_quantileinterpolates linearly between buckets. No bucket near your SLO = inaccurate percentile calculations. -
The alerting pipeline has four distinct failure points: rule evaluation, Alertmanager routing, receiver delivery, and silence/inhibition misconfiguration. Test each one independently.
-
Prometheus doesn't cluster -- it replicates. For high availability, run two independent instances and deduplicate with Thanos. For long-term storage, use remote write to Mimir, Thanos, or Cortex.
Related Lessons¶
- Prometheus and the Art of Not Alerting -- Alert design philosophy, SLO-based alerting, reducing alert fatigue
- Grafana Dashboards That Don't Lie -- Dashboard design, panel types, avoiding misleading visualizations
- The Monitoring That Lied -- Incident-driven lesson on monitoring failures and blind spots
- SLOs: When Good Enough Is a Number -- Error budgets, burn-rate alerting, SLI/SLO/SLA
- OpenTelemetry: Following a Request Across Services -- Distributed tracing, the third observability pillar
- Deploy a Web App From Nothing -- Build-up lesson that includes setting up monitoring from scratch