Skip to content

Prometheus Deep Dive - Primer

Why This Matters

Timeline: Prometheus was created at SoundCloud in 2012 by ex-Googlers Matt T. Proud and Julius Volz, inspired by Google's internal Borgmon monitoring system. It joined the CNCF as its second hosted project (after Kubernetes) in 2016 and graduated in 2018.

Prometheus is the dominant open-source monitoring system for infrastructure and application metrics. It powers alerting at companies running Kubernetes, microservices, and cloud-native stacks. When Prometheus works, you have dashboards showing request rates, error percentages, and latency percentiles updating in real time. When it breaks — targets disappear, queries time out, cardinality explodes, or alerts misfire — you lose visibility into everything else. Understanding Prometheus internals is the difference between operating a monitoring system and being operated by it.

Pull-Based Model

Prometheus scrapes metrics from targets over HTTP. Each target exposes a /metrics endpoint in a text-based exposition format. Prometheus periodically pulls from every target, parses the response, and stores the samples in its local time-series database (TSDB).

┌───────────────┐     GET /metrics      ┌──────────────┐
│  Prometheus   │ ────────────────────── │  Application  │
│  Server       │      every 15s        │  :8080/metrics│
└───────────────┘                        └──────────────┘
       ├── GET /metrics ──── node_exporter:9100
       ├── GET /metrics ──── cadvisor:8080
       └── GET /metrics ──── blackbox_exporter:9115

Why pull instead of push? Pull-based collection means Prometheus controls the rate, can detect when a target is down (scrape failure = target unhealthy), and does not require targets to know where Prometheus lives. Targets are stateless metric endpoints.

Metric Types

Counter

A monotonically increasing value. It only goes up (or resets to zero on process restart). Use for: total requests, total errors, bytes sent.

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",handler="/api/users",status="200"} 145232
http_requests_total{method="GET",handler="/api/users",status="500"} 37
http_requests_total{method="POST",handler="/api/orders",status="201"} 8842

You almost never alert on the raw counter value. You alert on the rate:

# Requests per second over the last 5 minutes
rate(http_requests_total[5m])

# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

Gauge

A value that can go up and down. Use for: current temperature, memory usage, queue depth, active connections.

# HELP node_memory_MemAvailable_bytes Available memory
# TYPE node_memory_MemAvailable_bytes gauge
node_memory_MemAvailable_bytes 4.294967296e+09
# Current memory usage percentage
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Rate of change of disk free space (is it shrinking?)
deriv(node_filesystem_free_bytes{mountpoint="/"}[1h])

Histogram

Samples observations and counts them in configurable buckets. Tracks count, sum, and bucket boundaries. Server-side aggregatable.

# HELP http_request_duration_seconds Request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{handler="/api/users",le="0.005"} 12000
http_request_duration_seconds_bucket{handler="/api/users",le="0.01"} 14500
http_request_duration_seconds_bucket{handler="/api/users",le="0.025"} 15200
http_request_duration_seconds_bucket{handler="/api/users",le="0.05"} 15400
http_request_duration_seconds_bucket{handler="/api/users",le="0.1"} 15450
http_request_duration_seconds_bucket{handler="/api/users",le="0.25"} 15480
http_request_duration_seconds_bucket{handler="/api/users",le="0.5"} 15490
http_request_duration_seconds_bucket{handler="/api/users",le="1"} 15495
http_request_duration_seconds_bucket{handler="/api/users",le="+Inf"} 15500
http_request_duration_seconds_sum{handler="/api/users"} 103.42
http_request_duration_seconds_count{handler="/api/users"} 15500
# p99 latency over the last 5 minutes
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# p50 latency (median)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

# Average latency
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

Summary

Like a histogram, but computes quantiles client-side. The quantile values are pre-calculated by the application and cannot be aggregated across instances.

# HELP rpc_duration_seconds RPC latency
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.5"} 0.012
rpc_duration_seconds{quantile="0.9"} 0.035
rpc_duration_seconds{quantile="0.99"} 0.148
rpc_duration_seconds_sum 2847.3
rpc_duration_seconds_count 98000

Histogram vs Summary: Use histograms unless you have a specific reason for summaries. Histograms allow server-side percentile calculation, aggregation across instances, and changing bucket boundaries without redeploying. Summaries are cheaper on the server but cannot be aggregated (you cannot combine p99 from 10 instances into a global p99).

PromQL

Selectors

# Exact match
http_requests_total{method="GET"}

# Regex match
http_requests_total{status=~"5.."}

# Negative match
http_requests_total{handler!="/health"}

# Combine
http_requests_total{method="POST", status=~"4..", handler!~"/internal/.*"}

Range Vectors and rate()

# rate(): per-second rate of counter increase, averaged over the range
rate(http_requests_total[5m])

# irate(): per-second rate using the last two data points only
# More responsive to spikes, noisier — use for dashboards, not alerting
irate(http_requests_total[5m])

# increase(): total increase over the range (= rate * range_seconds)
increase(http_requests_total[1h])

rate() handles counter resets gracefully. It detects when a counter drops (process restart) and compensates. Always use rate() or increase() on counters — never use raw counter values in alerts.

Aggregations

# Total request rate across all instances
sum(rate(http_requests_total[5m]))

# Request rate grouped by status code
sum by (status) (rate(http_requests_total[5m]))

# Top 5 handlers by request rate
topk(5, sum by (handler) (rate(http_requests_total[5m])))

# 95th percentile CPU across nodes
quantile(0.95, node_cpu_seconds_total)

# Count how many targets are up
count(up == 1)

Functions

# Predict when disk fills up (linear extrapolation)
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0

# Check if a metric exists (useful for alerting on missing scrapes)
absent(up{job="api-server"})

# Clamp values
clamp_min(free_memory_bytes, 0)

# Label manipulation
label_replace(up, "short_instance", "$1", "instance", "(.*):.*")

Scrape Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Static targets
  - job_name: "node-exporter"
    static_configs:
      - targets:
          - "node1:9100"
          - "node2:9100"
          - "node3:9100"

  # Kubernetes service discovery (pods)
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with the annotation prometheus.io/scrape: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use the prometheus.io/path annotation as the metrics path
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      # Use the prometheus.io/port annotation for the target port
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      # Add pod namespace and name as labels
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

  # Consul service discovery
  - job_name: "consul-services"
    consul_sd_configs:
      - server: "consul.service.consul:8500"
        services: []
    relabel_configs:
      - source_labels: [__meta_consul_tags]
        regex: .*,prometheus,.*
        action: keep

Relabeling

Relabeling transforms labels before (relabel_configs) or after (metric_relabel_configs) scraping:

# Drop high-cardinality metrics at scrape time
metric_relabel_configs:
  - source_labels: [__name__]
    regex: "go_gc_.*"
    action: drop
  # Remove a label to reduce cardinality
  - regex: "instance_id"
    action: labeldrop

Alerting

Alerting Rules

# rules/api-alerts.yml
groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High 5xx error rate: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.internal/runbooks/high-error-rate"

      - alert: TargetDown
        expr: up == 0
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "Target {{ $labels.instance }} is down"

      - alert: DiskFillingUp
        expr: |
          predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0
          and
          node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.3
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk on {{ $labels.instance }} predicted full within 24h"

Alertmanager

Alertmanager receives alerts from Prometheus and routes, groups, deduplicates, and delivers notifications.

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/T00/B00/xxxx"

route:
  receiver: default-slack
  group_by: [alertname, cluster, namespace]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
      continue: true
    - match:
        severity: critical
      receiver: slack-critical
    - match:
        team: database
      receiver: slack-database

receivers:
  - name: default-slack
    slack_configs:
      - channel: "#alerts"
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

  - name: pagerduty-critical
    pagerduty_configs:
      - service_key: "<pagerduty-integration-key>"
        severity: critical

  - name: slack-critical
    slack_configs:
      - channel: "#alerts-critical"

  - name: slack-database
    slack_configs:
      - channel: "#database-alerts"

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: [alertname, cluster, namespace]

Grouping: Combines multiple alerts with the same labels into a single notification. Prevents notification storms.

Inhibition: When a critical alert fires, suppress the corresponding warning. Reduces noise during incidents.

Silences: Temporary mutes for planned maintenance:

# Create a silence via the API
amtool silence add alertname="DiskFillingUp" instance="node3:9100" \
  --comment="Replacing disk on node3" --duration=4h

Recording Rules

Pre-compute expensive queries and store results as new time series. Reduces query time for dashboards and alerts.

# rules/recording-rules.yml
groups:
  - name: request-rate-recording
    interval: 30s
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_request_duration_seconds:p99_5m
        expr: histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

      - record: job:http_error_rate:ratio_5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))

Use recording rules when: - A query is used in multiple dashboards or alerts - The query takes >2 seconds to evaluate - The query uses histogram_quantile across many series

Prometheus in Kubernetes

Prometheus Operator

The Prometheus Operator manages Prometheus instances via Kubernetes CRDs:

# ServiceMonitor — tells Prometheus to scrape a service
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-service
  labels:
    team: platform
spec:
  selector:
    matchLabels:
      app: api-service
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
  namespaceSelector:
    matchNames:
      - production
---
# PodMonitor — scrape pods directly (no service required)
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: batch-jobs
spec:
  selector:
    matchLabels:
      app: batch-processor
  podMetricsEndpoints:
    - port: metrics
      interval: 30s
# Check if Prometheus is discovering the ServiceMonitor
kubectl get servicemonitors -A
kubectl get prometheuses -A -o yaml | grep serviceMonitorSelector

# If a ServiceMonitor is not being picked up:
# 1. Check the Prometheus CR's serviceMonitorSelector matches the SM labels
# 2. Check the SM's namespaceSelector includes the target namespace
# 3. Check Prometheus has RBAC to read endpoints in the target namespace

Long-Term Storage

Prometheus local TSDB retains data for a configured period (default 15 days). For longer retention, use a remote backend.

Remote Write/Read

# prometheus.yml
remote_write:
  - url: "http://mimir:9009/api/v1/push"
    queue_config:
      max_samples_per_send: 5000
      batch_send_deadline: 5s

remote_read:
  - url: "http://mimir:9009/prometheus/api/v1/read"
    read_recent: false

Thanos, Cortex, and Mimir

System Architecture Key Feature
Thanos Sidecar per Prometheus, object storage Global query view, deduplication, downsampling
Cortex Multi-tenant, horizontally scalable Managed service compatible, high availability
Mimir Cortex successor (Grafana Labs) Better performance, simpler ops, native multi-tenancy

Federation is a simpler alternative for small-scale multi-cluster setups: a top-level Prometheus scrapes /federate from leaf Prometheus instances. Works for <10 clusters; beyond that, use Thanos or Mimir.

Cardinality Management

Cardinality = the total number of unique time series. Each unique combination of metric name + label values is a separate series.

http_requests_total{method="GET", handler="/api/users", status="200", instance="10.0.1.5:8080"}
http_requests_total{method="GET", handler="/api/users", status="200", instance="10.0.1.6:8080"}

These are 2 separate time series. If you add a user_id label with 100,000 unique users, you suddenly have 100,000x more series. This is a cardinality explosion.

# Check cardinality on Prometheus TSDB status page
curl -s http://prometheus:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName[:10]'

# Find the top metrics by series count
curl -s http://prometheus:9090/api/v1/status/tsdb | jq '.data.headStats'

Rules: - Never use unbounded labels (user IDs, request IDs, UUIDs, email addresses) - Keep label value sets small and bounded (HTTP methods, status codes, handler paths) - Use metric_relabel_configs to drop or aggregate high-cardinality labels at scrape time

Instrumentation Best Practices

RED Method (for request-driven services)

  • Rate: requests per second — rate(http_requests_total[5m])
  • Errors: error rate — rate(http_requests_total{status=~"5.."}[5m])
  • Duration: latency — histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

USE Method (for resources: CPU, memory, disk, network)

  • Utilization: how full is it — node_cpu_seconds_total, node_memory_MemAvailable_bytes
  • Saturation: how overloaded — node_load1, queue lengths
  • Errors: error counts — node_disk_io_errors_total, node_network_receive_errs_total

Metric Naming Conventions

# Good: namespace_subsystem_name_unit
http_request_duration_seconds
node_disk_read_bytes_total
process_resident_memory_bytes

# Bad: no unit, ambiguous
request_latency          # seconds? milliseconds?
disk_usage               # bytes? percent?
memory                   # available? used? total?

Pushgateway

For short-lived batch jobs that cannot be scraped (they exit before Prometheus scrapes them).

# Push a metric from a batch job
echo "batch_job_duration_seconds 42.5" | curl --data-binary @- http://pushgateway:9091/metrics/job/nightly-etl/instance/etl-server

# Push multiple metrics
cat <<'EOF' | curl --data-binary @- http://pushgateway:9091/metrics/job/nightly-etl
# TYPE batch_job_duration_seconds gauge
batch_job_duration_seconds 42.5
# TYPE batch_job_records_processed gauge
batch_job_records_processed 150000
# TYPE batch_job_last_success_timestamp gauge
batch_job_last_success_timestamp 1711036800
EOF

Default trap: The Pushgateway has no built-in TTL for metrics. A batch job that crashes before pushing a "last success" timestamp will leave stale metrics that look healthy indefinitely. Always push a batch_job_last_success_timestamp and alert when it is too old.

Warning: Pushgateway metrics are sticky. If the batch job stops pushing, the last value persists indefinitely. The Pushgateway is not a substitute for scraping long-running services. Use it only for true batch/cron workloads.


Wiki Navigation

Prerequisites