Skip to content

Portal | Level: L2: Operations | Topics: Alerting Rules, Prometheus | Domain: Observability

Alerting Rules - Skill Check

Mental model (bottom-up)

Good alerts tell you about user-impacting problems, not infrastructure noise. Alert on symptoms (error rate, latency) not causes (CPU usage). Use recording rules for expensive queries, multi-window burn rate for SLO-based alerting, and inhibition to prevent alert storms. Every alert should have a runbook link.

Visual stack

[Prometheus          ]  evaluates rules every 15-30s
|
[Recording Rules     ]  pre-compute expensive queries
|
[Alert Rules         ]  expr + for duration → PENDING → FIRING
|
[Alertmanager        ]  group, route, dedupe, silence
|
[PagerDuty/Slack     ]  critical → page, warning → channel

Glossary

  • recording rule - pre-computed PromQL query stored as a new time series
  • for duration - how long condition must be true before alert fires (prevents flapping)
  • burn rate - how fast error budget is being consumed relative to the SLO window
  • inhibition - automatically suppress child alerts when parent alert fires
  • silence - manually mute a specific alert for a time window (maintenance)
  • absent() - PromQL function that returns 1 when a metric doesn't exist at all
  • predict_linear() - linear regression to predict future value (disk full alerts)

Core questions (easy -> hard)

  • What does for: 5m mean in an alert rule?
  • Condition must be continuously true for 5 minutes before the alert fires. Prevents alerting on brief spikes.
  • Why use recording rules?
  • Pre-compute expensive queries (like multi-label rates). Faster alert evaluation, consistent values across dashboards.
  • Write an alert for "5xx error rate above 5%."
  • sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
  • What's wrong with alerting on CPU > 80%?
  • It's a cause, not a symptom. High CPU doesn't always mean user impact. Alert on latency/errors instead.
  • Explain multi-window burn rate alerting.
  • Check both a long window (trend) and short window (still happening). 14.4x burn rate in 1h AND 5m = page. Prevents false positives from brief spikes.
  • Silence vs inhibition?
  • Silence: manual, time-boxed (maintenance). Inhibition: automatic rule-based (suppress warnings when critical fires).
  • How do you alert when a metric disappears entirely?
  • absent(up{job="myapp"} == 1) — returns 1 when the series doesn't exist. Catches targets that stop reporting.
  • How do you predict disk will be full in 4 hours?
  • predict_linear(node_filesystem_avail_bytes[6h], 4*3600) < 0

Wiki Navigation

Prerequisites