Portal | Level: L2: Operations | Topics: Alerting Rules, Prometheus | Domain: Observability
Alerting Rules Drills¶
Remember: Good alerts have three properties: Actionable (someone can do something about it), Relevant (it signals real user impact), Contextualized (includes enough info to start debugging). If an alert fires and the on-call shrugs and silences it, the alert is noise, not signal. Mnemonic: "ARC" — every alert needs an Action, Relevance, and Context.
Gotcha: A
for: 5mclause in a Prometheus alert means the expression must be continuously true for 5 minutes before firing. If the issue flaps (true-false-true), the timer resets each time it goes false. For flapping metrics, useavg_over_time()orcount_over_time()to smooth the signal instead of relying onfor.Default trap: Prometheus evaluates alert rules at the
evaluation_interval(default 1m). Withfor: 5m, the minimum time from issue to page isevaluation_interval + for= 6 minutes. If you need faster detection, reduce the evaluation interval for critical alert groups.
Drill 1: Basic Rate Alert¶
Difficulty: Easy
Q: Write a Prometheus alert rule that fires when the HTTP 5xx error rate exceeds 5% for 5 minutes.
Answer
The `for: 5m` means the condition must be true for 5 consecutive evaluation cycles before firing. This prevents alerting on brief spikes.Drill 2: Absent Metric Alert¶
Difficulty: Easy
Q: Write an alert that fires when a target stops reporting metrics entirely (no data, not just zero).
Answer
`absent()` returns 1 when the time series doesn't exist at all. This catches: - Target is down - Service discovery lost the target - Network issue preventing scraping Note: `absent()` returns empty when the series exists (even if value is 0).Drill 3: Disk Full Prediction¶
Difficulty: Medium
Q: Write an alert that predicts when a disk will be full within 4 hours based on the current fill rate.
Answer
- alert: DiskWillFillIn4Hours
expr: |
predict_linear(
node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h],
4 * 3600
) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "Disk {{ $labels.mountpoint }} on {{ $labels.instance }} will be full within 4 hours"
description: "Current available: {{ $value | humanize1024 }}B"
Drill 4: Pod Restart Alert¶
Difficulty: Easy
Q: Alert when any pod has restarted more than 3 times in the last hour.
Answer
Use `increase()` for counters when you want the total increase over a time range. `rate()` gives per-second rate.Drill 5: Latency P99 Alert¶
Difficulty: Medium
Q: Alert when p99 latency exceeds 1 second, using histogram metrics.
Answer
`histogram_quantile(φ, rate(bucket[range]))`: - `φ` = quantile (0.99 = 99th percentile) - Must use `rate()` on the bucket, not raw values - Group by `le` (less-than-or-equal bucket boundaries) - Use `sum by(le)` to aggregate across instancesDrill 6: Recording Rules for Performance¶
Difficulty: Medium
Q: The alerting query sum(rate(http_requests_total[5m])) is expensive and evaluated every 15s. How do you optimize it?
Answer
groups:
- name: http-recording-rules
interval: 30s
rules:
# Recording rule: precompute the rate
- record: http_requests:rate5m
expr: sum(rate(http_requests_total[5m]))
- record: http_requests:error_rate5m
expr: |
sum(rate(http_requests_total{code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
- record: http_requests:burnrate5m
expr: |
1 - (
sum(rate(http_requests_total{code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)
Drill 7: Silence vs Inhibit¶
Difficulty: Easy
Q: What's the difference between silencing and inhibiting alerts in Alertmanager?
Answer
**Silence**: Manually suppress a specific alert for a time window. **Inhibition**: Automatically suppress alerts when a related higher-severity alert is firing. This means: "If a critical alert fires for namespace X, suppress all warning alerts for the same namespace and alertname." Use cases: - Don't page for slow responses when the service is already down - Don't alert on pod restarts when the node is unreachableDrill 8: Alertmanager Routing¶
Difficulty: Medium
Q: Configure Alertmanager to route critical alerts to PagerDuty and warnings to Slack, with team-based routing.
Answer
# alertmanager.yml
route:
receiver: default-slack
group_by: [alertname, namespace]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: pagerduty-oncall
repeat_interval: 5m
- match:
severity: warning
team: platform
receiver: slack-platform
- match:
severity: warning
team: backend
receiver: slack-backend
receivers:
- name: default-slack
slack_configs:
- channel: '#alerts-general'
api_url: https://hooks.slack.com/services/xxx
- name: pagerduty-oncall
pagerduty_configs:
- service_key: xxx
severity: critical
- name: slack-platform
slack_configs:
- channel: '#platform-alerts'
api_url: https://hooks.slack.com/services/xxx
- name: slack-backend
slack_configs:
- channel: '#backend-alerts'
api_url: https://hooks.slack.com/services/xxx
Drill 9: Node Pressure Alerts¶
Difficulty: Medium
Q: Write alerts for node memory pressure, disk pressure, and PID pressure.
Answer
- alert: NodeMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure", status="true"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} has MemoryPressure"
- alert: NodeDiskPressure
expr: kube_node_status_condition{condition="DiskPressure", status="true"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} has DiskPressure"
- alert: NodePIDPressure
expr: kube_node_status_condition{condition="PIDPressure", status="true"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} has PIDPressure"
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready", status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is not Ready"
Drill 10: LogQL Alert¶
Difficulty: Medium
Q: Write a Loki/LogQL alert rule that fires when more than 10 error logs per minute are seen for any service.
Answer
# Loki ruler config
groups:
- name: log-alerts
rules:
- alert: HighErrorLogRate
expr: |
sum by(app)(
rate({namespace="production"} |= "error" | logfmt | level="error" [1m])
) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.app }} generating {{ $value }} error logs/min"
- alert: OOMKilledDetected
expr: |
count_over_time({namespace="production"} |= "OOMKilled" [5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "OOMKilled event detected in production"
Wiki Navigation¶
Prerequisites¶
- Alerting Rules (Topic Pack, L2)
Related Content¶
- Alerting Rules (Topic Pack, L2) — Alerting Rules, Prometheus
- Runbook: Alert Storm (Flapping / Too Many Alerts) (Runbook, L2) — Alerting Rules, Prometheus
- Skillcheck: Alerting Rules (Assessment, L2) — Alerting Rules, Prometheus
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
- Alerting Flashcards (CLI) (flashcard_deck, L1) — Alerting Rules
- Capacity Planning (Topic Pack, L2) — Prometheus
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Alerting Rules
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Prometheus
- Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Prometheus
- Datadog Flashcards (CLI) (flashcard_deck, L1) — Prometheus
Pages that link here¶
- Adversarial Interview Gauntlet
- Alerting Rules
- Alerting Rules - Skill Check
- Capacity Planning
- Capacity Planning - Primer
- Drills
- Log Analysis & Alerting Rules (PromQL / LogQL) - Primer
- Monitoring Fundamentals - Primer
- Observability Drills
- On-Call
- OpenTelemetry - Primer
- Ops Archaeology: The Alerts That Stopped Firing
- Ops Archaeology: The Slow Death Nobody Noticed
- Primer
- PromQL Drills