Prometheus Deep Dive - Street-Level Ops¶
Real-world workflows for operating Prometheus, debugging missing metrics, and writing production PromQL.
Debugging Missing Metrics¶
A dashboard panel shows "No Data." Work through this systematically:
# 1. Is the target being scraped at all?
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="api-server") | {instance: .labels.instance, health: .health, lastError: .lastError}'
# Output when healthy:
# { "instance": "10.0.1.5:8080", "health": "up", "lastError": "" }
# Output when broken:
# { "instance": "10.0.1.5:8080", "health": "down", "lastError": "Get http://10.0.1.5:8080/metrics: dial tcp 10.0.1.5:8080: connect: connection refused" }
# 2. Can you reach the metrics endpoint directly?
curl -s http://10.0.1.5:8080/metrics | head -20
# 3. Is the metric being scraped but relabeled away?
# Check relabel_configs and metric_relabel_configs in the scrape config
# A common mistake: a metric_relabel_configs drop rule is too broad
curl -s http://prometheus:9090/api/v1/targets/metadata?metric=http_requests_total
# 4. Is the metric present but the label selector in the query is wrong?
# Query without any label filters to see if the metric exists at all
curl -s 'http://prometheus:9090/api/v1/query?query=http_requests_total' | jq '.data.result | length'
# 5. Is the metric present but the time range is wrong?
# Staleness: Prometheus marks a series stale 5 minutes after the last scrape
# Check the last sample timestamp
curl -s 'http://prometheus:9090/api/v1/query?query=http_requests_total' | jq '.data.result[0].value'
# value[0] is the Unix timestamp, value[1] is the value
PromQL for SLOs and SLIs¶
Error Rate SLI (Request Success Rate)¶
# 5xx error ratio over the last 30 minutes
1 - (
sum(rate(http_requests_total{status=~"5.."}[30m]))
/
sum(rate(http_requests_total[30m]))
)
# Multi-window alert: fire only when both short AND long windows breach
# Short window catches acute incidents; long window prevents flapping
# Alert: SLO is 99.9% (error budget = 0.1%)
groups:
- name: slo-alerts
rules:
- alert: HighErrorRate_MultiWindow
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
> 0.01
)
and
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
> 0.005
)
for: 2m
labels:
severity: critical
Latency SLI (p99 < 500ms)¶
# What fraction of requests complete under 500ms?
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
# Alert when <99% of requests are under 500ms
- alert: LatencySLOBreach
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[10m]))
/
sum(rate(http_request_duration_seconds_count[10m]))
< 0.99
for: 5m
Availability SLI (Synthetic/Blackbox)¶
# Blackbox exporter probe success rate
avg_over_time(probe_success{job="blackbox-http"}[1h])
# Alert when availability drops below 99.9% over 1 hour
- alert: EndpointAvailabilityLow
expr: avg_over_time(probe_success{job="blackbox-http", target="https://api.example.com/health"}[1h]) < 0.999
for: 5m
High Cardinality Detection¶
# TSDB status page — shows top metrics by series count
curl -s http://prometheus:9090/api/v1/status/tsdb | jq '{
totalSeries: .data.headStats.numSeries,
topMetrics: [.data.seriesCountByMetricName[:10][] | {name: .name, count: .value}],
topLabels: [.data.labelValueCountByLabelName[:10][] | {label: .name, values: .value}]
}'
# Example output:
# {
# "totalSeries": 1250000,
# "topMetrics": [
# { "name": "container_network_receive_bytes_total", "count": 45000 },
# { "name": "http_requests_total", "count": 32000 }
# ],
# "topLabels": [
# { "label": "pod", "values": 8500 },
# { "label": "container_id", "values": 8500 }
# ]
# }
# Find which label is causing the explosion on a specific metric
curl -s 'http://prometheus:9090/api/v1/query?query=count(http_requests_total) by (handler)' | \
jq '[.data.result[] | {handler: .metric.handler, series: .value[1]}] | sort_by(-.series) | .[:10]'
# Drop a high-cardinality label at scrape time
# In prometheus.yml:
metric_relabel_configs:
- source_labels: [__name__]
regex: "http_requests_total"
action: keep
- regex: "request_id"
action: labeldrop
Alertmanager Routing Debugging¶
# Test which receiver an alert would route to
amtool config routes test --config.file=/etc/alertmanager/alertmanager.yml \
severity=critical team=platform alertname=HighErrorRate
# Output:
# pagerduty-critical
# Show the full routing tree
amtool config routes show --config.file=/etc/alertmanager/alertmanager.yml
# List active alerts
amtool alert query --alertmanager.url=http://alertmanager:9093
# List active silences
amtool silence query --alertmanager.url=http://alertmanager:9093
Silencing Alerts During Maintenance¶
# Silence all alerts for a specific instance for 2 hours
amtool silence add instance="node3:9100" \
--alertmanager.url=http://alertmanager:9093 \
--author="oncall@example.com" \
--comment="Scheduled kernel upgrade on node3" \
--duration=2h
# Silence a specific alert across all instances
amtool silence add alertname="DiskFillingUp" \
--alertmanager.url=http://alertmanager:9093 \
--author="oncall@example.com" \
--comment="Adding new storage nodes, disk will fill temporarily" \
--duration=4h
# List active silences
amtool silence query --alertmanager.url=http://alertmanager:9093
# Expire a silence early
amtool silence expire <silence-id> --alertmanager.url=http://alertmanager:9093
Grafana Dashboard Patterns¶
Request Rate Panel (RED)¶
# Graph: request rate by status code
sum by (status) (rate(http_requests_total{job="api-server"}[5m]))
# Single stat: total RPS
sum(rate(http_requests_total{job="api-server"}[5m]))
# Heatmap: latency distribution
sum by (le) (rate(http_request_duration_seconds_bucket{job="api-server"}[5m]))
Resource Utilization Panel (USE)¶
# CPU usage percentage per node
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage per node
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk usage percentage
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100
Kubernetes Pod Metrics¶
# Pod CPU usage (cores)
sum by (pod) (rate(container_cpu_usage_seconds_total{namespace="production", container!="POD", container!=""}[5m]))
# Pod memory usage (bytes)
sum by (pod) (container_memory_working_set_bytes{namespace="production", container!="POD", container!=""})
# Pod restart count
sum by (pod) (kube_pod_container_status_restarts_total{namespace="production"})
Federation for Multi-Cluster¶
# On the global Prometheus, scrape /federate from each cluster Prometheus
scrape_configs:
- job_name: "federate-cluster-east"
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{job=~".*"}' # scrape all jobs (be selective in production)
- 'up'
static_configs:
- targets:
- "prometheus-east.internal:9090"
metric_relabel_configs:
- target_label: cluster
replacement: east
- job_name: "federate-cluster-west"
honor_labels: true
metrics_path: /federate
params:
match[]:
- 'job:http_requests_total:rate5m' # only federate recording rules
- 'job:http_error_rate:ratio_5m'
- 'up'
static_configs:
- targets:
- "prometheus-west.internal:9090"
metric_relabel_configs:
- target_label: cluster
replacement: west
Best practice: Federate recording rules, not raw metrics. Raw metric federation at scale overwhelms the global Prometheus.
Scale note: Federation scrapes the
/federateendpoint which evaluates the match[] selectors on every scrape. Withmatch[]={job=~".*"}on a Prometheus with 1M+ series, each federation scrape can take 10-30 seconds and consume significant CPU. Always federate narrow recording rule names, not broad selectors.
Recording Rules for Expensive Queries¶
# If your dashboard p99 query takes 8 seconds to evaluate,
# pre-compute it as a recording rule
groups:
- name: latency-recording
interval: 30s
rules:
# Per-handler p99
- record: handler:http_request_duration_seconds:p99_5m
expr: |
histogram_quantile(0.99,
sum by (handler, le) (rate(http_request_duration_seconds_bucket[5m]))
)
# Global p99
- record: job:http_request_duration_seconds:p99_5m
expr: |
histogram_quantile(0.99,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
# Error budget remaining (30-day window)
- record: job:error_budget_remaining:ratio
expr: |
1 - (
sum by (job) (increase(http_requests_total{status=~"5.."}[30d]))
/
sum by (job) (increase(http_requests_total[30d]))
) / 0.001
Target Discovery Debugging (Kubernetes)¶
# Check what targets Prometheus has discovered
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets | length'
# See discovered but dropped targets (relabeling filtered them out)
curl -s http://prometheus:9090/api/v1/targets | jq '.data.droppedTargets | length'
# Find targets in a specific job
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="kubernetes-pods") | .labels.instance'
# Verify Kubernetes ServiceMonitor is being picked up
kubectl get servicemonitors -A
kubectl get prometheus -n monitoring -o yaml | grep -A5 serviceMonitorSelector
# Check Prometheus Operator logs for reconciliation errors
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-operator --tail=50
# Verify the target service has the right labels
kubectl get svc -n production api-service --show-labels
# Common issue: ServiceMonitor selector doesn't match service labels
# The SM selector matches on SERVICE labels, not pod labels
Gotcha: The Prometheus Operator's
serviceMonitorSelectoron the Prometheus CR must match the ServiceMonitor's labels -- not the target Service's labels. This is a two-level selector: the Prometheus CR selects which ServiceMonitors to load, and each ServiceMonitor selects which Services to scrape. Missing either level results in zero targets with no error message. ```text
Prometheus Storage Sizing¶
```bash
Check current TSDB stats¶
curl -s http://prometheus:9090/api/v1/status/tsdb | jq '{ headSeries: .data.headStats.numSeries, headChunks: .data.headStats.numChunks, minTime: .data.headStats.minTime, maxTime: .data.headStats.maxTime }'
Estimate storage:¶
bytes_per_sample ≈ 1.5-2 bytes (with compression)¶
storage = series_count * scrape_interval_samples_per_day * retention_days * bytes_per_sample¶
¶
Example: 500,000 series, 15s interval, 15 days retention¶
= 500,000 * (86400/15) * 15 * 1.7 bytes¶
= 500,000 * 5,760 * 15 * 1.7¶
≈ 73 GB¶
Monitor WAL size (write-ahead log can grow during high churn)¶
du -sh /prometheus/wal/
Check chunk and block sizes on disk¶
du -sh /prometheus/chunks_head/ ls -lh /prometheus/ | grep -E "^d.*01" ```text
Metric Relabeling to Drop High-Cardinality Labels¶
```yaml
prometheus.yml — scrape config for a service with noisy metrics¶
scrape_configs: - job_name: "api-server" static_configs: - targets: ["api:8080"] metric_relabel_configs: # Drop all go_gc internal metrics (reduce noise) - source_labels: [name] regex: "go_gc_duration_seconds" action: drop
# Drop the request_id label from http metrics (unbounded cardinality)
- source_labels: [__name__]
regex: "http_.*"
action: keep
- regex: "request_id"
action: labeldrop
# Aggregate path labels: /api/users/123 -> /api/users/:id
# (Do this in the application instrumentation, not here — but in emergencies:)
- source_labels: [handler]
regex: "/api/users/[0-9]+"
target_label: handler
replacement: "/api/users/:id"
```text