Alerting Rules Cheat Sheet¶
Remember: The
forduration in an alert rule is the most important field for reducing false positives. Without it, a single bad scrape fires the alert. Withfor: 5m, the condition must be continuously true for 5 minutes. Mnemonic:for= "filter out random blips." Set it based on the alert's severity: critical alerts can have shorterfor(1-5m), warnings longer (10-30m).
Alert Rule Anatomy¶
groups:
- name: my-alerts
rules:
- alert: AlertName # Name of the alert
expr: metric > threshold # PromQL expression
for: 5m # Must be true this long before firing
labels:
severity: warning # Used for routing
annotations:
summary: "Human-readable" # Shown in notifications
description: "Details"
Essential PromQL Functions for Alerts¶
rate(counter[5m]) # Per-second rate of a counter
increase(counter[1h]) # Total increase over time window
histogram_quantile(0.99, rate(bucket[5m])) # Percentile from histogram
absent(metric) # 1 if metric doesn't exist
predict_linear(gauge[6h], 4*3600) # Linear prediction
changes(gauge[1h]) # Number of value changes
delta(gauge[1h]) # Difference between first and last
Common Alert Patterns¶
Error Rate¶
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
Latency¶
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum by(le)(rate(http_request_duration_seconds_bucket[5m]))
) > 1
for: 10m
Target Down¶
Disk Prediction¶
- alert: DiskFullIn4Hours
expr: predict_linear(node_filesystem_avail_bytes[6h], 4*3600) < 0
for: 30m
Pod Restarts¶
Node Pressure¶
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
Certificate Expiry¶
- alert: CertExpiringSoon
expr: certmanager_certificate_expiration_timestamp_seconds - time() < 14*24*3600
for: 1h
Gotcha:
absent()fires when a metric does not exist at all — but it also fires during scrape gaps, Prometheus restarts, or target downtime. Combine withfor: 10mto avoid false pages during brief Prometheus maintenance. Also,absent()returns a single time series with no labels, so routing rules that depend on labels (likenamespace) will not match.
Recording Rules (Performance)¶
groups:
- name: recording-rules
rules:
- record: http_requests:rate5m
expr: sum(rate(http_requests_total[5m]))
- record: http_requests:error_ratio5m
expr: |
sum(rate(http_requests_total{code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
Use in alerts: http_requests:error_ratio5m > 0.05
Alertmanager Routing¶
route:
receiver: default
group_by: [alertname, namespace]
group_wait: 30s # Wait to batch related alerts
group_interval: 5m # Between batches
repeat_interval: 4h # Resend if still firing
routes:
- match: { severity: critical }
receiver: pagerduty
- match: { severity: warning }
receiver: slack
Under the hood: Recording rules pre-compute expensive queries and store results as new time series. The naming convention
level:metric:operations(e.g.,job:http_requests:rate5m) is a Prometheus community standard:level= aggregation label,metric= base metric,operations= what was applied. Use recording rules whenever an alert query takes >1 second or is used in multiple places.
Silence vs Inhibit¶
| Feature | Silence | Inhibition |
|---|---|---|
| Type | Manual, time-boxed | Automatic, rule-based |
| Use | Maintenance windows | Suppress cascading alerts |
| Config | Alertmanager UI or amtool | alertmanager.yml |
# Inhibition: suppress warnings when critical fires
inhibit_rules:
- source_matchers: [severity = critical]
target_matchers: [severity = warning]
equal: [namespace, alertname]
LogQL Alerts (Loki)¶
- alert: HighErrorLogs
expr: |
sum by(app)(rate({namespace="prod"} |= "error" [1m])) > 10
for: 5m
- alert: OOMDetected
expr: count_over_time({namespace="prod"} |= "OOMKilled" [5m]) > 0
Alert Quality Checklist¶
[ ] Has a clear, actionable summary
[ ] Has `for` duration to prevent flapping
[ ] Severity matches actual impact
[ ] Links to a runbook in annotations
[ ] Tested in staging before production
[ ] Won't fire during normal operation
[ ] Has a clear owner/team label
[ ] Recording rules used for expensive queries