Prometheus Deep Dive - Street-Level Ops¶

Real-world workflows for operating Prometheus, debugging missing metrics, and writing production PromQL.

Debugging Missing Metrics¶

A dashboard panel shows "No Data." Work through this systematically:

# 1. Is the target being scraped at all?
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="api-server") | {instance: .labels.instance, health: .health, lastError: .lastError}'

# Output when healthy:
# { "instance": "10.0.1.5:8080", "health": "up", "lastError": "" }

# Output when broken:
# { "instance": "10.0.1.5:8080", "health": "down", "lastError": "Get http://10.0.1.5:8080/metrics: dial tcp 10.0.1.5:8080: connect: connection refused" }

# 2. Can you reach the metrics endpoint directly?
curl -s http://10.0.1.5:8080/metrics | head -20

# 3. Is the metric being scraped but relabeled away?
# Check relabel_configs and metric_relabel_configs in the scrape config
# A common mistake: a metric_relabel_configs drop rule is too broad
curl -s http://prometheus:9090/api/v1/targets/metadata?metric=http_requests_total

# 4. Is the metric present but the label selector in the query is wrong?
# Query without any label filters to see if the metric exists at all
curl -s 'http://prometheus:9090/api/v1/query?query=http_requests_total' | jq '.data.result | length'

# 5. Is the metric present but the time range is wrong?
#    Staleness: Prometheus marks a series stale 5 minutes after the last scrape
# Check the last sample timestamp
curl -s 'http://prometheus:9090/api/v1/query?query=http_requests_total' | jq '.data.result[0].value'
# value[0] is the Unix timestamp, value[1] is the value

PromQL for SLOs and SLIs¶

Error Rate SLI (Request Success Rate)¶

# 5xx error ratio over the last 30 minutes
1 - (
  sum(rate(http_requests_total{status=~"5.."}[30m]))
  /
  sum(rate(http_requests_total[30m]))
)

# Multi-window alert: fire only when both short AND long windows breach
# Short window catches acute incidents; long window prevents flapping
# Alert: SLO is 99.9% (error budget = 0.1%)
groups:
  - name: slo-alerts
    rules:
      - alert: HighErrorRate_MultiWindow
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m]))
            > 0.01
          )
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))
            > 0.005
          )
        for: 2m
        labels:
          severity: critical

Latency SLI (p99 < 500ms)¶

# What fraction of requests complete under 500ms?
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

# Alert when <99% of requests are under 500ms
- alert: LatencySLOBreach
  expr: |
    sum(rate(http_request_duration_seconds_bucket{le="0.5"}[10m]))
    /
    sum(rate(http_request_duration_seconds_count[10m]))
    < 0.99
  for: 5m

Availability SLI (Synthetic/Blackbox)¶

# Blackbox exporter probe success rate
avg_over_time(probe_success{job="blackbox-http"}[1h])

# Alert when availability drops below 99.9% over 1 hour
- alert: EndpointAvailabilityLow
  expr: avg_over_time(probe_success{job="blackbox-http", target="https://api.example.com/health"}[1h]) < 0.999
  for: 5m

High Cardinality Detection¶

# TSDB status page — shows top metrics by series count
curl -s http://prometheus:9090/api/v1/status/tsdb | jq '{
  totalSeries: .data.headStats.numSeries,
  topMetrics: [.data.seriesCountByMetricName[:10][] | {name: .name, count: .value}],
  topLabels: [.data.labelValueCountByLabelName[:10][] | {label: .name, values: .value}]
}'

# Example output:
# {
#   "totalSeries": 1250000,
#   "topMetrics": [
#     { "name": "container_network_receive_bytes_total", "count": 45000 },
#     { "name": "http_requests_total", "count": 32000 }
#   ],
#   "topLabels": [
#     { "label": "pod", "values": 8500 },
#     { "label": "container_id", "values": 8500 }
#   ]
# }

# Find which label is causing the explosion on a specific metric
curl -s 'http://prometheus:9090/api/v1/query?query=count(http_requests_total) by (handler)' | \
  jq '[.data.result[] | {handler: .metric.handler, series: .value[1]}] | sort_by(-.series) | .[:10]'

# Drop a high-cardinality label at scrape time
# In prometheus.yml:
metric_relabel_configs:
  - source_labels: [__name__]
    regex: "http_requests_total"
    action: keep
  - regex: "request_id"
    action: labeldrop

Alertmanager Routing Debugging¶

# Test which receiver an alert would route to
amtool config routes test --config.file=/etc/alertmanager/alertmanager.yml \
  severity=critical team=platform alertname=HighErrorRate

# Output:
# pagerduty-critical

# Show the full routing tree
amtool config routes show --config.file=/etc/alertmanager/alertmanager.yml

# List active alerts
amtool alert query --alertmanager.url=http://alertmanager:9093

# List active silences
amtool silence query --alertmanager.url=http://alertmanager:9093

Silencing Alerts During Maintenance¶

# Silence all alerts for a specific instance for 2 hours
amtool silence add instance="node3:9100" \
  --alertmanager.url=http://alertmanager:9093 \
  --author="oncall@example.com" \
  --comment="Scheduled kernel upgrade on node3" \
  --duration=2h

# Silence a specific alert across all instances
amtool silence add alertname="DiskFillingUp" \
  --alertmanager.url=http://alertmanager:9093 \
  --author="oncall@example.com" \
  --comment="Adding new storage nodes, disk will fill temporarily" \
  --duration=4h

# List active silences
amtool silence query --alertmanager.url=http://alertmanager:9093

# Expire a silence early
amtool silence expire <silence-id> --alertmanager.url=http://alertmanager:9093

Grafana Dashboard Patterns¶

Request Rate Panel (RED)¶

# Graph: request rate by status code
sum by (status) (rate(http_requests_total{job="api-server"}[5m]))

# Single stat: total RPS
sum(rate(http_requests_total{job="api-server"}[5m]))

# Heatmap: latency distribution
sum by (le) (rate(http_request_duration_seconds_bucket{job="api-server"}[5m]))

Resource Utilization Panel (USE)¶

# CPU usage percentage per node
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage per node
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk usage percentage
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

Kubernetes Pod Metrics¶

# Pod CPU usage (cores)
sum by (pod) (rate(container_cpu_usage_seconds_total{namespace="production", container!="POD", container!=""}[5m]))

# Pod memory usage (bytes)
sum by (pod) (container_memory_working_set_bytes{namespace="production", container!="POD", container!=""})

# Pod restart count
sum by (pod) (kube_pod_container_status_restarts_total{namespace="production"})

Federation for Multi-Cluster¶

# On the global Prometheus, scrape /federate from each cluster Prometheus
scrape_configs:
  - job_name: "federate-cluster-east"
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        - '{job=~".*"}'          # scrape all jobs (be selective in production)
        - 'up'
    static_configs:
      - targets:
          - "prometheus-east.internal:9090"
    metric_relabel_configs:
      - target_label: cluster
        replacement: east

  - job_name: "federate-cluster-west"
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        - 'job:http_requests_total:rate5m'   # only federate recording rules
        - 'job:http_error_rate:ratio_5m'
        - 'up'
    static_configs:
      - targets:
          - "prometheus-west.internal:9090"
    metric_relabel_configs:
      - target_label: cluster
        replacement: west

Best practice: Federate recording rules, not raw metrics. Raw metric federation at scale overwhelms the global Prometheus.

Scale note: Federation scrapes the /federate endpoint which evaluates the match[] selectors on every scrape. With match[]={job=~".*"} on a Prometheus with 1M+ series, each federation scrape can take 10-30 seconds and consume significant CPU. Always federate narrow recording rule names, not broad selectors.

Recording Rules for Expensive Queries¶

# If your dashboard p99 query takes 8 seconds to evaluate,
# pre-compute it as a recording rule
groups:
  - name: latency-recording
    interval: 30s
    rules:
      # Per-handler p99
      - record: handler:http_request_duration_seconds:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum by (handler, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

      # Global p99
      - record: job:http_request_duration_seconds:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

      # Error budget remaining (30-day window)
      - record: job:error_budget_remaining:ratio
        expr: |
          1 - (
            sum by (job) (increase(http_requests_total{status=~"5.."}[30d]))
            /
            sum by (job) (increase(http_requests_total[30d]))
          ) / 0.001

Target Discovery Debugging (Kubernetes)¶

# Check what targets Prometheus has discovered
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets | length'

# See discovered but dropped targets (relabeling filtered them out)
curl -s http://prometheus:9090/api/v1/targets | jq '.data.droppedTargets | length'

# Find targets in a specific job
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="kubernetes-pods") | .labels.instance'

# Verify Kubernetes ServiceMonitor is being picked up
kubectl get servicemonitors -A
kubectl get prometheus -n monitoring -o yaml | grep -A5 serviceMonitorSelector

# Check Prometheus Operator logs for reconciliation errors
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-operator --tail=50

# Verify the target service has the right labels
kubectl get svc -n production api-service --show-labels

# Common issue: ServiceMonitor selector doesn't match service labels
# The SM selector matches on SERVICE labels, not pod labels

Gotcha: The Prometheus Operator's serviceMonitorSelector on the Prometheus CR must match the ServiceMonitor's labels -- not the target Service's labels. This is a two-level selector: the Prometheus CR selects which ServiceMonitors to load, and each ServiceMonitor selects which Services to scrape. Missing either level results in zero targets with no error message. ```text

Prometheus Storage Sizing¶

```bash

Check current TSDB stats¶

curl -s http://prometheus:9090/api/v1/status/tsdb | jq '{ headSeries: .data.headStats.numSeries, headChunks: .data.headStats.numChunks, minTime: .data.headStats.minTime, maxTime: .data.headStats.maxTime }'

Estimate storage:¶

bytes_per_sample ≈ 1.5-2 bytes (with compression)¶

storage = series_count * scrape_interval_samples_per_day * retention_days * bytes_per_sample¶

¶

Example: 500,000 series, 15s interval, 15 days retention¶

= 500,000 * (86400/15) * 15 * 1.7 bytes¶

= 500,000 * 5,760 * 15 * 1.7¶

≈ 73 GB¶

Monitor WAL size (write-ahead log can grow during high churn)¶

du -sh /prometheus/wal/

Check chunk and block sizes on disk¶

du -sh /prometheus/chunks_head/ ls -lh /prometheus/ | grep -E "^d.*01" ```text

Metric Relabeling to Drop High-Cardinality Labels¶

```yaml

prometheus.yml — scrape config for a service with noisy metrics¶

scrape_configs: - job_name: "api-server" static_configs: - targets: ["api:8080"] metric_relabel_configs: # Drop all go_gc internal metrics (reduce noise) - source_labels: [name] regex: "go_gc_duration_seconds" action: drop

  # Drop the request_id label from http metrics (unbounded cardinality)
  - source_labels: [__name__]
    regex: "http_.*"
    action: keep
  - regex: "request_id"
    action: labeldrop

  # Aggregate path labels: /api/users/123 -> /api/users/:id
  # (Do this in the application instrumentation, not here — but in emergencies:)
  - source_labels: [handler]
    regex: "/api/users/[0-9]+"
    target_label: handler
    replacement: "/api/users/:id"

```text