Portal | Level: L2: Operations | Topics: Alerting Rules, Prometheus | Domain: Observability

Alerting Rules - Skill Check¶

Mental model (bottom-up)¶

Good alerts tell you about user-impacting problems, not infrastructure noise. Alert on symptoms (error rate, latency) not causes (CPU usage). Use recording rules for expensive queries, multi-window burn rate for SLO-based alerting, and inhibition to prevent alert storms. Every alert should have a runbook link.

Visual stack¶

[Prometheus          ]  evaluates rules every 15-30s
|
[Recording Rules     ]  pre-compute expensive queries
|
[Alert Rules         ]  expr + for duration → PENDING → FIRING
|
[Alertmanager        ]  group, route, dedupe, silence
|
[PagerDuty/Slack     ]  critical → page, warning → channel

Glossary¶

recording rule - pre-computed PromQL query stored as a new time series
for duration - how long condition must be true before alert fires (prevents flapping)
burn rate - how fast error budget is being consumed relative to the SLO window
inhibition - automatically suppress child alerts when parent alert fires
silence - manually mute a specific alert for a time window (maintenance)
absent() - PromQL function that returns 1 when a metric doesn't exist at all
predict_linear() - linear regression to predict future value (disk full alerts)

Core questions (easy -> hard)¶

What does for: 5m mean in an alert rule?
Condition must be continuously true for 5 minutes before the alert fires. Prevents alerting on brief spikes.
Why use recording rules?
Pre-compute expensive queries (like multi-label rates). Faster alert evaluation, consistent values across dashboards.
Write an alert for "5xx error rate above 5%."
sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
What's wrong with alerting on CPU > 80%?
It's a cause, not a symptom. High CPU doesn't always mean user impact. Alert on latency/errors instead.
Explain multi-window burn rate alerting.
Check both a long window (trend) and short window (still happening). 14.4x burn rate in 1h AND 5m = page. Prevents false positives from brief spikes.
Silence vs inhibition?
Silence: manual, time-boxed (maintenance). Inhibition: automatic rule-based (suppress warnings when critical fires).
How do you alert when a metric disappears entirely?
absent(up{job="myapp"} == 1) — returns 1 when the series doesn't exist. Catches targets that stop reporting.
How do you predict disk will be full in 4 hours?
predict_linear(node_filesystem_avail_bytes[6h], 4*3600) < 0

Prerequisites¶

Alerting Rules (Topic Pack, L2)

Alerting Rules (Topic Pack, L2) — Alerting Rules, Prometheus
Alerting Rules Drills (Drill, L2) — Alerting Rules, Prometheus
Runbook: Alert Storm (Flapping / Too Many Alerts) (Runbook, L2) — Alerting Rules, Prometheus
Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
Alerting Flashcards (CLI) (flashcard_deck, L1) — Alerting Rules
Capacity Planning (Topic Pack, L2) — Prometheus
Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Alerting Rules
Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Prometheus
Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Prometheus
Datadog Flashcards (CLI) (flashcard_deck, L1) — Prometheus

Alerting Rules - Skill Check¶

Mental model (bottom-up)¶

Visual stack¶

Glossary¶

Core questions (easy -> hard)¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

Alerting Rules - Skill Check¶

Mental model (bottom-up)¶

Visual stack¶

Glossary¶

Core questions (easy -> hard)¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶