Skip to content

Alerting Rules Cheat Sheet

Remember: The for duration in an alert rule is the most important field for reducing false positives. Without it, a single bad scrape fires the alert. With for: 5m, the condition must be continuously true for 5 minutes. Mnemonic: for = "filter out random blips." Set it based on the alert's severity: critical alerts can have shorter for (1-5m), warnings longer (10-30m).

Alert Rule Anatomy

groups:
- name: my-alerts
  rules:
  - alert: AlertName              # Name of the alert
    expr: metric > threshold      # PromQL expression
    for: 5m                       # Must be true this long before firing
    labels:
      severity: warning           # Used for routing
    annotations:
      summary: "Human-readable"   # Shown in notifications
      description: "Details"

Essential PromQL Functions for Alerts

rate(counter[5m])          # Per-second rate of a counter
increase(counter[1h])      # Total increase over time window
histogram_quantile(0.99, rate(bucket[5m]))  # Percentile from histogram
absent(metric)             # 1 if metric doesn't exist
predict_linear(gauge[6h], 4*3600)  # Linear prediction
changes(gauge[1h])         # Number of value changes
delta(gauge[1h])           # Difference between first and last

Common Alert Patterns

Error Rate

- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{code=~"5.."}[5m]))
    / sum(rate(http_requests_total[5m])) > 0.05
  for: 5m

Latency

- alert: HighLatency
  expr: |
    histogram_quantile(0.99,
      sum by(le)(rate(http_request_duration_seconds_bucket[5m]))
    ) > 1
  for: 10m

Target Down

- alert: TargetDown
  expr: absent(up{job="myapp"} == 1)
  for: 5m

Disk Prediction

- alert: DiskFullIn4Hours
  expr: predict_linear(node_filesystem_avail_bytes[6h], 4*3600) < 0
  for: 30m

Pod Restarts

- alert: PodCrashLooping
  expr: increase(kube_pod_container_status_restarts_total[1h]) > 3
  for: 10m

Node Pressure

- alert: NodeNotReady
  expr: kube_node_status_condition{condition="Ready",status="true"} == 0
  for: 5m

Certificate Expiry

- alert: CertExpiringSoon
  expr: certmanager_certificate_expiration_timestamp_seconds - time() < 14*24*3600
  for: 1h

Gotcha: absent() fires when a metric does not exist at all — but it also fires during scrape gaps, Prometheus restarts, or target downtime. Combine with for: 10m to avoid false pages during brief Prometheus maintenance. Also, absent() returns a single time series with no labels, so routing rules that depend on labels (like namespace) will not match.

Recording Rules (Performance)

groups:
- name: recording-rules
  rules:
  - record: http_requests:rate5m
    expr: sum(rate(http_requests_total[5m]))

  - record: http_requests:error_ratio5m
    expr: |
      sum(rate(http_requests_total{code=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m]))

Use in alerts: http_requests:error_ratio5m > 0.05

Alertmanager Routing

route:
  receiver: default
  group_by: [alertname, namespace]
  group_wait: 30s          # Wait to batch related alerts
  group_interval: 5m       # Between batches
  repeat_interval: 4h      # Resend if still firing
  routes:
  - match: { severity: critical }
    receiver: pagerduty
  - match: { severity: warning }
    receiver: slack

Under the hood: Recording rules pre-compute expensive queries and store results as new time series. The naming convention level:metric:operations (e.g., job:http_requests:rate5m) is a Prometheus community standard: level = aggregation label, metric = base metric, operations = what was applied. Use recording rules whenever an alert query takes >1 second or is used in multiple places.

Silence vs Inhibit

Feature Silence Inhibition
Type Manual, time-boxed Automatic, rule-based
Use Maintenance windows Suppress cascading alerts
Config Alertmanager UI or amtool alertmanager.yml
# Inhibition: suppress warnings when critical fires
inhibit_rules:
- source_matchers: [severity = critical]
  target_matchers: [severity = warning]
  equal: [namespace, alertname]

LogQL Alerts (Loki)

- alert: HighErrorLogs
  expr: |
    sum by(app)(rate({namespace="prod"} |= "error" [1m])) > 10
  for: 5m

- alert: OOMDetected
  expr: count_over_time({namespace="prod"} |= "OOMKilled" [5m]) > 0

Alert Quality Checklist

[ ] Has a clear, actionable summary
[ ] Has `for` duration to prevent flapping
[ ] Severity matches actual impact
[ ] Links to a runbook in annotations
[ ] Tested in staging before production
[ ] Won't fire during normal operation
[ ] Has a clear owner/team label
[ ] Recording rules used for expensive queries