Portal | Level: L2: Operations | Topics: Alerting Rules, Prometheus | Domain: Observability
Alerting Rules - Skill Check¶
Mental model (bottom-up)¶
Good alerts tell you about user-impacting problems, not infrastructure noise. Alert on symptoms (error rate, latency) not causes (CPU usage). Use recording rules for expensive queries, multi-window burn rate for SLO-based alerting, and inhibition to prevent alert storms. Every alert should have a runbook link.
Visual stack¶
[Prometheus ] evaluates rules every 15-30s
|
[Recording Rules ] pre-compute expensive queries
|
[Alert Rules ] expr + for duration → PENDING → FIRING
|
[Alertmanager ] group, route, dedupe, silence
|
[PagerDuty/Slack ] critical → page, warning → channel
Glossary¶
- recording rule - pre-computed PromQL query stored as a new time series
forduration - how long condition must be true before alert fires (prevents flapping)- burn rate - how fast error budget is being consumed relative to the SLO window
- inhibition - automatically suppress child alerts when parent alert fires
- silence - manually mute a specific alert for a time window (maintenance)
absent()- PromQL function that returns 1 when a metric doesn't exist at allpredict_linear()- linear regression to predict future value (disk full alerts)
Core questions (easy -> hard)¶
- What does
for: 5mmean in an alert rule? - Condition must be continuously true for 5 minutes before the alert fires. Prevents alerting on brief spikes.
- Why use recording rules?
- Pre-compute expensive queries (like multi-label rates). Faster alert evaluation, consistent values across dashboards.
- Write an alert for "5xx error rate above 5%."
sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05- What's wrong with alerting on CPU > 80%?
- It's a cause, not a symptom. High CPU doesn't always mean user impact. Alert on latency/errors instead.
- Explain multi-window burn rate alerting.
- Check both a long window (trend) and short window (still happening). 14.4x burn rate in 1h AND 5m = page. Prevents false positives from brief spikes.
- Silence vs inhibition?
- Silence: manual, time-boxed (maintenance). Inhibition: automatic rule-based (suppress warnings when critical fires).
- How do you alert when a metric disappears entirely?
absent(up{job="myapp"} == 1)— returns 1 when the series doesn't exist. Catches targets that stop reporting.- How do you predict disk will be full in 4 hours?
predict_linear(node_filesystem_avail_bytes[6h], 4*3600) < 0
Wiki Navigation¶
Prerequisites¶
- Alerting Rules (Topic Pack, L2)
Related Content¶
- Alerting Rules (Topic Pack, L2) — Alerting Rules, Prometheus
- Alerting Rules Drills (Drill, L2) — Alerting Rules, Prometheus
- Runbook: Alert Storm (Flapping / Too Many Alerts) (Runbook, L2) — Alerting Rules, Prometheus
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
- Alerting Flashcards (CLI) (flashcard_deck, L1) — Alerting Rules
- Capacity Planning (Topic Pack, L2) — Prometheus
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Alerting Rules
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Prometheus
- Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Prometheus
- Datadog Flashcards (CLI) (flashcard_deck, L1) — Prometheus
Pages that link here¶
- Adversarial Interview Gauntlet
- Alerting Rules
- Alerting Rules Drills
- Capacity Planning
- Log Analysis & Alerting Rules (PromQL / LogQL) - Primer
- Observability Domain
- On-Call
- Runbook: Alert Storm (Flapping / Too Many Alerts)
- Scenario: Prometheus Says Target Down
- Symptoms: Alert Storm, Caused by Flapping Health Checks, Fix Is Probe Tuning
- Symptoms: Disk Full Alert, Cause Is Runaway Logs, Fix Is Loki Retention
- Symptoms: Grafana Dashboard Empty, Prometheus Scrape Blocked by NetworkPolicy