Skip to content

Prometheus and the Art of Not Alerting

  • lesson
  • prometheus
  • promql
  • alerting
  • alert-fatigue
  • slos
  • error-budgets
  • observability-culture
  • l2 ---# Prometheus and the Art of Not Alerting

Topics: Prometheus, PromQL, alerting, alert fatigue, SLOs, error budgets, observability culture Level: L2 (Operations) Time: 75–90 minutes Prerequisites: None (Prometheus concepts explained from scratch)


The Mission

Your team has 847 alert rules across 12 services. You get 300+ notifications per week. On-call engineers have developed "alert yoga" — page buzzes, glance at title, acknowledge, go back to sleep. Mean time to acknowledge: 11 seconds. Mean time to actually investigate: 4 hours 12 minutes.

The alerting system, in its current state, is functionally identical to having no monitoring at all.

This lesson teaches you how monitoring and alerting should work — not more alerts, but better ones. Fewer rules that fire less often and always mean something.


How Prometheus Works (60-Second Version)

Prometheus pulls metrics from your services by scraping HTTP endpoints at regular intervals (default 15 seconds):

Your app exposes → GET /metrics
Prometheus scrapes → stores time series (metric name + labels + timestamp + value)
You query → PromQL
You alert → recording rules + alert rules → Alertmanager → PagerDuty/Slack
# What /metrics looks like
curl http://myapp:8000/metrics
# → http_requests_total{method="GET",status="200"} 12345
# → http_requests_total{method="POST",status="500"} 42
# → http_request_duration_seconds_bucket{le="0.1"} 11000
# → http_request_duration_seconds_bucket{le="0.5"} 12100
# → http_request_duration_seconds_bucket{le="1.0"} 12300

Four metric types:

Type What it does Example
Counter Only goes up (resets on restart) http_requests_total
Gauge Goes up and down temperature_celsius, active_connections
Histogram Counts observations in buckets http_request_duration_seconds
Summary Client-side quantiles Avoid unless you have specific reasons

Name Origin: Prometheus was created at SoundCloud in 2012 by Matt T. Proud and Julius Volz, inspired by Google's Borgmon. Named after the Greek titan who stole fire from the gods and gave it to humanity — fitting for a tool that makes observability accessible. It joined CNCF as its second project (after Kubernetes) in 2016.


The Cardinal Sins of Alerting

Sin 1: Alerting on Infrastructure, Not Customer Impact

# BAD — alerts on CPU utilization
- alert: HighCPU
  expr: node_cpu_seconds_total > 0.8
  for: 5m
  labels:
    severity: critical

This pages you for batch jobs, garbage collection, builds, and anything else that legitimately uses CPU. None of these affect customers. You learn to ignore "HighCPU" alerts. Then one day high CPU actually IS causing user-visible latency, and you ignore that too.

# GOOD — alerts on customer impact
- alert: HighErrorRate
  expr: |
    rate(http_requests_total{status=~"5.."}[5m])
    / rate(http_requests_total[5m]) > 0.01
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error rate exceeds 1% for {{ $labels.service }}"

Mental Model: Monitor infrastructure on dashboards, alert on customer impact. A high CPU gauge is useful context during an investigation. But it should never be the trigger for a page. The trigger should be: "users are experiencing errors/latency."

Sin 2: Missing the for: Duration

# BAD — fires on a single evaluation cycle
- alert: HighLatency
  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5

A 30-second traffic spike triggers this. You get paged. By the time you look, it's resolved. Repeat 3 times per night.

# GOOD — condition must be true for 5 continuous minutes
- alert: HighLatency
  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5
  for: 5m

Sin 3: Alerting on Low-Traffic Error Rate

# BAD — 1 error out of 10 requests = 10% error rate
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01

At 3 AM, your service gets 10 requests per minute. One timeout = 10% error rate. Alert fires. There is no real problem.

# GOOD — require minimum traffic volume
- alert: HighErrorRate
  expr: |
    rate(http_requests_total{status=~"5.."}[5m])
    / rate(http_requests_total[5m]) > 0.01
    AND rate(http_requests_total[5m]) > 1
  for: 5m

Sin 4: Missing absent() for Critical Metrics

Your app crashes. It stops emitting metrics. The error rate alert sees "no data" and doesn't fire — because there's nothing to calculate a rate from. The app is completely down, and your monitoring shows green.

# Add absent() alerts for critical services
- alert: ServiceDown
  expr: absent(up{job="api"} == 1)
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "API service has stopped emitting metrics"

Gotcha: absent() fires only when a metric completely disappears. If you have 10 instances and 9 crash, absent() doesn't fire because 1 instance still emits. Use count(up{job="api"} == 0) > 0 to detect partial outages.


PromQL: The Essential Patterns

Rate (the most important function)

# Convert counter to per-second rate over 5 minutes
rate(http_requests_total[5m])

# Error rate as a percentage
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m])

# Request rate per service
sum by (service) (rate(http_requests_total[5m]))

Gotcha: rate() needs a range of at least 4x your scrape interval. For a 15-second scrape, minimum is [1m], safe default is [5m]. Shorter ranges return nothing when a scrape is missed.

Histogram Percentiles

# p99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# p50 (median)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

Gotcha: Histogram accuracy depends on bucket boundaries. If your SLO is 200ms but your nearest buckets are 100ms and 500ms, the percentile calculation interpolates linearly between them — potentially very wrong. Add bucket boundaries at 150ms, 200ms, 250ms for accurate SLO tracking.


SLOs and Error Budgets — Alerting That Works

Instead of hundreds of symptom-based alerts, define what "good" means and alert when you're running out of goodness.

Step 1: Define the SLO

Our API will serve 99.9% of requests successfully in under 500ms
over a 30-day rolling window.

This gives you an error budget: 0.1% of requests can fail or be slow.

Step 2: Calculate the budget

Over 30 days with 1,000,000 requests/day:

Total requests: 30,000,000
Error budget: 30,000,000 × 0.001 = 30,000 failed requests
Daily budget: 1,000 failed requests

Step 3: Alert on budget burn rate

# Alert when burning budget 14x faster than sustainable (1h window)
# This would exhaust the monthly budget in ~2 days
- alert: SLOBurnRateHigh
  expr: |
    (
      rate(http_requests_total{status=~"5.."}[1h])
      / rate(http_requests_total[1h])
    ) > (14 * 0.001)
  for: 5m
  labels:
    severity: critical

# Alert when burning budget 3x faster (6h window) — sustained degradation
- alert: SLOBurnRateSlow
  expr: |
    (
      rate(http_requests_total{status=~"5.."}[6h])
      / rate(http_requests_total[6h])
    ) > (3 * 0.001)
  for: 30m
  labels:
    severity: warning

This is the Google SRE approach: alert on the rate of budget consumption, not on individual symptoms. A 1% error rate is fine if it's a 5-minute blip. A 0.2% error rate is a problem if it's sustained for 6 hours — because it'll exhaust the monthly budget.

Remember: The nines cheat sheet: - 99% = 7.3 hours/month of downtime - 99.9% = 43.8 minutes/month - 99.95% = 21.9 minutes/month - 99.99% = 4.4 minutes/month - Each additional nine costs ~10x more engineering effort.


The Cardinality Trap

Prometheus stores one time series per unique combination of metric name + label values. High-cardinality labels explode your storage:

# BAD — user_id has unbounded cardinality
http_requests_total{user_id="user_12345", method="GET", status="200"}
# 200,000 users × 5 methods × 10 status codes = 10,000,000 series!

# GOOD — bounded labels only
http_requests_total{service="api", method="GET", status="200"}
# 1 service × 5 methods × 10 status codes = 50 series

Gotcha: Histograms generate 13+ series per unique label combination (one per bucket + _sum + _count). If you add a label with 100 values to a histogram, you get 1,300 series from that one metric. Always check cardinality before adding labels.

# Check top cardinality consumers
curl -s http://prometheus:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName[:10]'

Alert Bankruptcy — When to Start Over

War Story: A team had 847 alerts, 300+ notifications per week. On-call acknowledged in 8 seconds (muscle memory for the button) without reading them. A disk_usage_critical alert on the primary database server was acknowledged and ignored at 11:42 PM. By 12:18 AM, the disk was 100% full, PostgreSQL went read-only, and the write outage lasted 64 minutes. Detection time: 45 of those 64 minutes — because the alert was buried in noise.

The team declared alert bankruptcy: deleted 680 of 847 alerts. Every remaining alert had to meet three criteria: (1) indicates customer-facing impact, (2) requires human action, (3) responder knows what action to take. Alerts dropped from 847 to 167. Pages per week: 300+ to 15. Mean time to engage on real alerts: 3 minutes.

The three-question test for every alert

Before creating an alert rule, answer:

  1. Does this indicate customer impact? If not, it's a dashboard metric, not an alert.
  2. Does it require human action? If it auto-heals (Kubernetes restarts, auto-scaling), it shouldn't page anyone.
  3. Does the responder know what to do? Every alert should link to a runbook. If you can't write the runbook, you can't write the alert.

If you can't answer yes to all three, the alert shouldn't exist.


Flashcard Check

Q1: rate() on a counter with [30s] range and 15s scrape interval returns nothing. Why?

The range needs at least 4x the scrape interval (60s minimum for 15s scrape). With only 2 samples in the range, one missed scrape means zero usable data.

Q2: absent(up{job="api"}) — when does this fire?

Only when the metric completely disappears (all instances down). If 9 of 10 instances crash, absent() doesn't fire. Use count(up{job="api"} == 0) > 0 instead.

Q3: What is a burn rate alert?

Alerts on how fast you're consuming your error budget. A 14x burn rate means you'll exhaust the monthly budget in ~2 days. This catches both acute spikes and sustained degradation.

Q4: Why is alerting on "CPU > 80%" bad?

CPU usage doesn't indicate customer impact. Batch jobs, GC, and builds legitimately use CPU. You learn to ignore CPU alerts, then miss the one time it matters.

Q5: 200,000 users × histogram = how many time series?

200,000 × 13 (buckets + sum + count) = 2,600,000 series from ONE metric. This will OOM your Prometheus. Never use unbounded labels (user IDs, request IDs).

Q6: 99.9% SLO over 30 days — what's the monthly error budget?

43.8 minutes of downtime, or 0.1% of total requests. Each additional nine costs ~10x more engineering effort.


Exercises

Exercise 1: Fix these alert rules (refactor)

# What's wrong with each alert?

# Alert 1
- alert: HighCPU
  expr: node_cpu_seconds_total{mode="system"} > 0.9

# Alert 2
- alert: ServiceErrors
  expr: http_errors_total > 100

# Alert 3
- alert: DiskFull
  expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15
Answers 1. **HighCPU**: Alerts on infrastructure, not impact. No `for:` duration. Replace with error rate or latency alert. 2. **ServiceErrors**: Alerts on a raw counter (always increasing). Should be `rate()`. No `for:` duration. No minimum traffic threshold. 3. **DiskFull**: Actually decent! But could be better: add `predict_linear()` to catch *growing* disks, not just currently-full ones. Add `for: 10m` to avoid transient blips. Exclude read-only and tmpfs filesystems.

Exercise 2: Design an SLO (think)

Your API has: - Average response time: 120ms - p99 response time: 400ms - Error rate: 0.3% - Traffic: 50,000 requests/day

Design an SLO and calculate the error budget.

One approach SLO: 99.5% of requests return successfully (2xx) in under 500ms over a 30-day window. Budget: 50,000 × 30 × 0.005 = 7,500 failed/slow requests per month = 250/day. Current: 0.3% errors = 150 errors/day. p99 at 400ms (under 500ms threshold). You're well within budget — room for feature velocity without reliability anxiety.

Cheat Sheet

PromQL Essentials

Pattern Query
Request rate sum(rate(http_requests_total[5m]))
Error rate (%) rate(errors[5m]) / rate(total[5m])
p99 latency histogram_quantile(0.99, rate(duration_bucket[5m]))
Saturation avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
Absent metric absent(up{job="api"} == 1)
Prediction predict_linear(node_filesystem_avail_bytes[6h], 4*3600) < 0

Alert Rule Checklist

Every alert MUST have: - [ ] for: duration (5m minimum for critical, 15m for warning) - [ ] Minimum traffic threshold (avoid low-volume false positives) - [ ] Link to runbook in annotations - [ ] Answers "yes" to: customer impact? requires action? responder knows what to do?

The Nines

SLO Monthly budget
99% 7h 18m
99.9% 43m 48s
99.95% 21m 54s
99.99% 4m 22s

Takeaways

  1. Fewer alerts, not more. Every alert is a commitment to respond. 847 alerts you ignore is worse than 50 you act on.

  2. Alert on customer impact, not infrastructure. CPU, memory, and disk usage belong on dashboards. Error rate and latency trigger pages.

  3. SLO-based alerting beats symptom-based. Burn rate alerts catch both acute spikes and sustained degradation — without false positives.

  4. Cardinality kills Prometheus. One unbounded label can create millions of series. Check cardinality before adding labels. Histograms multiply the problem by 13.

  5. Every alert needs a runbook. If you can't write what the responder should do, the alert shouldn't exist.


  • The Mysterious Latency Spike — diagnosing the problem your alert just told you about
  • How Incident Response Actually Works — what happens after the page
  • The Cascading Timeout — when one service's problems cascade to everything