- observability
- l2
- topic-pack
- prometheus-deep-dive
- prometheus --- Portal | Level: L2: Operations | Topics: Prometheus Deep Dive, Prometheus | Domain: Observability
Prometheus Deep Dive - Primer¶
Why This Matters¶
Timeline: Prometheus was created at SoundCloud in 2012 by ex-Googlers Matt T. Proud and Julius Volz, inspired by Google's internal Borgmon monitoring system. It joined the CNCF as its second hosted project (after Kubernetes) in 2016 and graduated in 2018.
Prometheus is the dominant open-source monitoring system for infrastructure and application metrics. It powers alerting at companies running Kubernetes, microservices, and cloud-native stacks. When Prometheus works, you have dashboards showing request rates, error percentages, and latency percentiles updating in real time. When it breaks — targets disappear, queries time out, cardinality explodes, or alerts misfire — you lose visibility into everything else. Understanding Prometheus internals is the difference between operating a monitoring system and being operated by it.
Pull-Based Model¶
Prometheus scrapes metrics from targets over HTTP. Each target exposes a /metrics endpoint in a text-based exposition format. Prometheus periodically pulls from every target, parses the response, and stores the samples in its local time-series database (TSDB).
┌───────────────┐ GET /metrics ┌──────────────┐
│ Prometheus │ ────────────────────── │ Application │
│ Server │ every 15s │ :8080/metrics│
└───────────────┘ └──────────────┘
│
├── GET /metrics ──── node_exporter:9100
├── GET /metrics ──── cadvisor:8080
└── GET /metrics ──── blackbox_exporter:9115
Why pull instead of push? Pull-based collection means Prometheus controls the rate, can detect when a target is down (scrape failure = target unhealthy), and does not require targets to know where Prometheus lives. Targets are stateless metric endpoints.
Metric Types¶
Counter¶
A monotonically increasing value. It only goes up (or resets to zero on process restart). Use for: total requests, total errors, bytes sent.
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",handler="/api/users",status="200"} 145232
http_requests_total{method="GET",handler="/api/users",status="500"} 37
http_requests_total{method="POST",handler="/api/orders",status="201"} 8842
You almost never alert on the raw counter value. You alert on the rate:
# Requests per second over the last 5 minutes
rate(http_requests_total[5m])
# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Gauge¶
A value that can go up and down. Use for: current temperature, memory usage, queue depth, active connections.
# HELP node_memory_MemAvailable_bytes Available memory
# TYPE node_memory_MemAvailable_bytes gauge
node_memory_MemAvailable_bytes 4.294967296e+09
# Current memory usage percentage
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Rate of change of disk free space (is it shrinking?)
deriv(node_filesystem_free_bytes{mountpoint="/"}[1h])
Histogram¶
Samples observations and counts them in configurable buckets. Tracks count, sum, and bucket boundaries. Server-side aggregatable.
# HELP http_request_duration_seconds Request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{handler="/api/users",le="0.005"} 12000
http_request_duration_seconds_bucket{handler="/api/users",le="0.01"} 14500
http_request_duration_seconds_bucket{handler="/api/users",le="0.025"} 15200
http_request_duration_seconds_bucket{handler="/api/users",le="0.05"} 15400
http_request_duration_seconds_bucket{handler="/api/users",le="0.1"} 15450
http_request_duration_seconds_bucket{handler="/api/users",le="0.25"} 15480
http_request_duration_seconds_bucket{handler="/api/users",le="0.5"} 15490
http_request_duration_seconds_bucket{handler="/api/users",le="1"} 15495
http_request_duration_seconds_bucket{handler="/api/users",le="+Inf"} 15500
http_request_duration_seconds_sum{handler="/api/users"} 103.42
http_request_duration_seconds_count{handler="/api/users"} 15500
# p99 latency over the last 5 minutes
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# p50 latency (median)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
# Average latency
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
Summary¶
Like a histogram, but computes quantiles client-side. The quantile values are pre-calculated by the application and cannot be aggregated across instances.
# HELP rpc_duration_seconds RPC latency
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.5"} 0.012
rpc_duration_seconds{quantile="0.9"} 0.035
rpc_duration_seconds{quantile="0.99"} 0.148
rpc_duration_seconds_sum 2847.3
rpc_duration_seconds_count 98000
Histogram vs Summary: Use histograms unless you have a specific reason for summaries. Histograms allow server-side percentile calculation, aggregation across instances, and changing bucket boundaries without redeploying. Summaries are cheaper on the server but cannot be aggregated (you cannot combine p99 from 10 instances into a global p99).
PromQL¶
Selectors¶
# Exact match
http_requests_total{method="GET"}
# Regex match
http_requests_total{status=~"5.."}
# Negative match
http_requests_total{handler!="/health"}
# Combine
http_requests_total{method="POST", status=~"4..", handler!~"/internal/.*"}
Range Vectors and rate()¶
# rate(): per-second rate of counter increase, averaged over the range
rate(http_requests_total[5m])
# irate(): per-second rate using the last two data points only
# More responsive to spikes, noisier — use for dashboards, not alerting
irate(http_requests_total[5m])
# increase(): total increase over the range (= rate * range_seconds)
increase(http_requests_total[1h])
rate() handles counter resets gracefully. It detects when a counter drops (process restart) and compensates. Always use rate() or increase() on counters — never use raw counter values in alerts.
Aggregations¶
# Total request rate across all instances
sum(rate(http_requests_total[5m]))
# Request rate grouped by status code
sum by (status) (rate(http_requests_total[5m]))
# Top 5 handlers by request rate
topk(5, sum by (handler) (rate(http_requests_total[5m])))
# 95th percentile CPU across nodes
quantile(0.95, node_cpu_seconds_total)
# Count how many targets are up
count(up == 1)
Functions¶
# Predict when disk fills up (linear extrapolation)
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0
# Check if a metric exists (useful for alerting on missing scrapes)
absent(up{job="api-server"})
# Clamp values
clamp_min(free_memory_bytes, 0)
# Label manipulation
label_replace(up, "short_instance", "$1", "instance", "(.*):.*")
Scrape Configuration¶
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Static targets
- job_name: "node-exporter"
static_configs:
- targets:
- "node1:9100"
- "node2:9100"
- "node3:9100"
# Kubernetes service discovery (pods)
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with the annotation prometheus.io/scrape: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use the prometheus.io/path annotation as the metrics path
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Use the prometheus.io/port annotation for the target port
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Add pod namespace and name as labels
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
# Consul service discovery
- job_name: "consul-services"
consul_sd_configs:
- server: "consul.service.consul:8500"
services: []
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: .*,prometheus,.*
action: keep
Relabeling¶
Relabeling transforms labels before (relabel_configs) or after (metric_relabel_configs) scraping:
# Drop high-cardinality metrics at scrape time
metric_relabel_configs:
- source_labels: [__name__]
regex: "go_gc_.*"
action: drop
# Remove a label to reduce cardinality
- regex: "instance_id"
action: labeldrop
Alerting¶
Alerting Rules¶
# rules/api-alerts.yml
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "High 5xx error rate: {{ $value | humanizePercentage }}"
runbook: "https://wiki.internal/runbooks/high-error-rate"
- alert: TargetDown
expr: up == 0
for: 3m
labels:
severity: warning
annotations:
summary: "Target {{ $labels.instance }} is down"
- alert: DiskFillingUp
expr: |
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0
and
node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.3
for: 30m
labels:
severity: warning
annotations:
summary: "Disk on {{ $labels.instance }} predicted full within 24h"
Alertmanager¶
Alertmanager receives alerts from Prometheus and routes, groups, deduplicates, and delivers notifications.
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/T00/B00/xxxx"
route:
receiver: default-slack
group_by: [alertname, cluster, namespace]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: pagerduty-critical
continue: true
- match:
severity: critical
receiver: slack-critical
- match:
team: database
receiver: slack-database
receivers:
- name: default-slack
slack_configs:
- channel: "#alerts"
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: pagerduty-critical
pagerduty_configs:
- service_key: "<pagerduty-integration-key>"
severity: critical
- name: slack-critical
slack_configs:
- channel: "#alerts-critical"
- name: slack-database
slack_configs:
- channel: "#database-alerts"
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: [alertname, cluster, namespace]
Grouping: Combines multiple alerts with the same labels into a single notification. Prevents notification storms.
Inhibition: When a critical alert fires, suppress the corresponding warning. Reduces noise during incidents.
Silences: Temporary mutes for planned maintenance:
# Create a silence via the API
amtool silence add alertname="DiskFillingUp" instance="node3:9100" \
--comment="Replacing disk on node3" --duration=4h
Recording Rules¶
Pre-compute expensive queries and store results as new time series. Reduces query time for dashboards and alerts.
# rules/recording-rules.yml
groups:
- name: request-rate-recording
interval: 30s
rules:
- record: job:http_requests_total:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_request_duration_seconds:p99_5m
expr: histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))
- record: job:http_error_rate:ratio_5m
expr: |
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
Use recording rules when:
- A query is used in multiple dashboards or alerts
- The query takes >2 seconds to evaluate
- The query uses histogram_quantile across many series
Prometheus in Kubernetes¶
Prometheus Operator¶
The Prometheus Operator manages Prometheus instances via Kubernetes CRDs:
# ServiceMonitor — tells Prometheus to scrape a service
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-service
labels:
team: platform
spec:
selector:
matchLabels:
app: api-service
endpoints:
- port: metrics
interval: 15s
path: /metrics
namespaceSelector:
matchNames:
- production
---
# PodMonitor — scrape pods directly (no service required)
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: batch-jobs
spec:
selector:
matchLabels:
app: batch-processor
podMetricsEndpoints:
- port: metrics
interval: 30s
# Check if Prometheus is discovering the ServiceMonitor
kubectl get servicemonitors -A
kubectl get prometheuses -A -o yaml | grep serviceMonitorSelector
# If a ServiceMonitor is not being picked up:
# 1. Check the Prometheus CR's serviceMonitorSelector matches the SM labels
# 2. Check the SM's namespaceSelector includes the target namespace
# 3. Check Prometheus has RBAC to read endpoints in the target namespace
Long-Term Storage¶
Prometheus local TSDB retains data for a configured period (default 15 days). For longer retention, use a remote backend.
Remote Write/Read¶
# prometheus.yml
remote_write:
- url: "http://mimir:9009/api/v1/push"
queue_config:
max_samples_per_send: 5000
batch_send_deadline: 5s
remote_read:
- url: "http://mimir:9009/prometheus/api/v1/read"
read_recent: false
Thanos, Cortex, and Mimir¶
| System | Architecture | Key Feature |
|---|---|---|
| Thanos | Sidecar per Prometheus, object storage | Global query view, deduplication, downsampling |
| Cortex | Multi-tenant, horizontally scalable | Managed service compatible, high availability |
| Mimir | Cortex successor (Grafana Labs) | Better performance, simpler ops, native multi-tenancy |
Federation is a simpler alternative for small-scale multi-cluster setups: a top-level Prometheus scrapes /federate from leaf Prometheus instances. Works for <10 clusters; beyond that, use Thanos or Mimir.
Cardinality Management¶
Cardinality = the total number of unique time series. Each unique combination of metric name + label values is a separate series.
http_requests_total{method="GET", handler="/api/users", status="200", instance="10.0.1.5:8080"}
http_requests_total{method="GET", handler="/api/users", status="200", instance="10.0.1.6:8080"}
These are 2 separate time series. If you add a user_id label with 100,000 unique users, you suddenly have 100,000x more series. This is a cardinality explosion.
# Check cardinality on Prometheus TSDB status page
curl -s http://prometheus:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName[:10]'
# Find the top metrics by series count
curl -s http://prometheus:9090/api/v1/status/tsdb | jq '.data.headStats'
Rules:
- Never use unbounded labels (user IDs, request IDs, UUIDs, email addresses)
- Keep label value sets small and bounded (HTTP methods, status codes, handler paths)
- Use metric_relabel_configs to drop or aggregate high-cardinality labels at scrape time
Instrumentation Best Practices¶
RED Method (for request-driven services)¶
- Rate: requests per second —
rate(http_requests_total[5m]) - Errors: error rate —
rate(http_requests_total{status=~"5.."}[5m]) - Duration: latency —
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
USE Method (for resources: CPU, memory, disk, network)¶
- Utilization: how full is it —
node_cpu_seconds_total,node_memory_MemAvailable_bytes - Saturation: how overloaded —
node_load1, queue lengths - Errors: error counts —
node_disk_io_errors_total,node_network_receive_errs_total
Metric Naming Conventions¶
# Good: namespace_subsystem_name_unit
http_request_duration_seconds
node_disk_read_bytes_total
process_resident_memory_bytes
# Bad: no unit, ambiguous
request_latency # seconds? milliseconds?
disk_usage # bytes? percent?
memory # available? used? total?
Pushgateway¶
For short-lived batch jobs that cannot be scraped (they exit before Prometheus scrapes them).
# Push a metric from a batch job
echo "batch_job_duration_seconds 42.5" | curl --data-binary @- http://pushgateway:9091/metrics/job/nightly-etl/instance/etl-server
# Push multiple metrics
cat <<'EOF' | curl --data-binary @- http://pushgateway:9091/metrics/job/nightly-etl
# TYPE batch_job_duration_seconds gauge
batch_job_duration_seconds 42.5
# TYPE batch_job_records_processed gauge
batch_job_records_processed 150000
# TYPE batch_job_last_success_timestamp gauge
batch_job_last_success_timestamp 1711036800
EOF
Default trap: The Pushgateway has no built-in TTL for metrics. A batch job that crashes before pushing a "last success" timestamp will leave stale metrics that look healthy indefinitely. Always push a
batch_job_last_success_timestampand alert when it is too old.
Warning: Pushgateway metrics are sticky. If the batch job stops pushing, the last value persists indefinitely. The Pushgateway is not a substitute for scraping long-running services. Use it only for true batch/cron workloads.
Wiki Navigation¶
Prerequisites¶
- Observability Deep Dive (Topic Pack, L2)
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
- Alerting Rules (Topic Pack, L2) — Prometheus
- Alerting Rules Drills (Drill, L2) — Prometheus
- Capacity Planning (Topic Pack, L2) — Prometheus
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Prometheus
- Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Prometheus
- Datadog Flashcards (CLI) (flashcard_deck, L1) — Prometheus
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Prometheus
- Interview: Prometheus Target Down (Scenario, L2) — Prometheus
- Lab: Prometheus Target Down (CLI) (Lab, L2) — Prometheus
Pages that link here¶
- Anti-Primer: Prometheus Deep Dive
- Certification Prep: PCA — Prometheus Certified Associate
- Comparison: Metrics Platforms
- Production Readiness Review: Answer Key
- Production Readiness Review: Study Plans
- Prometheus Deep Dive
- Scenario: Prometheus Says Target Down
- Thinking Out Loud: Prometheus Deep Dive