Alerting Rules Cheat Sheet¶

Remember: The for duration in an alert rule is the most important field for reducing false positives. Without it, a single bad scrape fires the alert. With for: 5m, the condition must be continuously true for 5 minutes. Mnemonic: for = "filter out random blips." Set it based on the alert's severity: critical alerts can have shorter for (1-5m), warnings longer (10-30m).

Alert Rule Anatomy¶

groups:
- name: my-alerts
  rules:
  - alert: AlertName              # Name of the alert
    expr: metric > threshold      # PromQL expression
    for: 5m                       # Must be true this long before firing
    labels:
      severity: warning           # Used for routing
    annotations:
      summary: "Human-readable"   # Shown in notifications
      description: "Details"

Essential PromQL Functions for Alerts¶

rate(counter[5m])          # Per-second rate of a counter
increase(counter[1h])      # Total increase over time window
histogram_quantile(0.99, rate(bucket[5m]))  # Percentile from histogram
absent(metric)             # 1 if metric doesn't exist
predict_linear(gauge[6h], 4*3600)  # Linear prediction
changes(gauge[1h])         # Number of value changes
delta(gauge[1h])           # Difference between first and last

Common Alert Patterns¶

Error Rate¶

- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{code=~"5.."}[5m]))
    / sum(rate(http_requests_total[5m])) > 0.05
  for: 5m

Latency¶

- alert: HighLatency
  expr: |
    histogram_quantile(0.99,
      sum by(le)(rate(http_request_duration_seconds_bucket[5m]))
    ) > 1
  for: 10m

Target Down¶

- alert: TargetDown
  expr: absent(up{job="myapp"} == 1)
  for: 5m

Disk Prediction¶

- alert: DiskFullIn4Hours
  expr: predict_linear(node_filesystem_avail_bytes[6h], 4*3600) < 0
  for: 30m

Pod Restarts¶

- alert: PodCrashLooping
  expr: increase(kube_pod_container_status_restarts_total[1h]) > 3
  for: 10m

Node Pressure¶

- alert: NodeNotReady
  expr: kube_node_status_condition{condition="Ready",status="true"} == 0
  for: 5m

Certificate Expiry¶

- alert: CertExpiringSoon
  expr: certmanager_certificate_expiration_timestamp_seconds - time() < 14*24*3600
  for: 1h

Gotcha: absent() fires when a metric does not exist at all — but it also fires during scrape gaps, Prometheus restarts, or target downtime. Combine with for: 10m to avoid false pages during brief Prometheus maintenance. Also, absent() returns a single time series with no labels, so routing rules that depend on labels (like namespace) will not match.

Recording Rules (Performance)¶

groups:
- name: recording-rules
  rules:
  - record: http_requests:rate5m
    expr: sum(rate(http_requests_total[5m]))

  - record: http_requests:error_ratio5m
    expr: |
      sum(rate(http_requests_total{code=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m]))

Use in alerts: http_requests:error_ratio5m > 0.05

Alertmanager Routing¶

route:
  receiver: default
  group_by: [alertname, namespace]
  group_wait: 30s          # Wait to batch related alerts
  group_interval: 5m       # Between batches
  repeat_interval: 4h      # Resend if still firing
  routes:
  - match: { severity: critical }
    receiver: pagerduty
  - match: { severity: warning }
    receiver: slack

Under the hood: Recording rules pre-compute expensive queries and store results as new time series. The naming convention level:metric:operations (e.g., job:http_requests:rate5m) is a Prometheus community standard: level = aggregation label, metric = base metric, operations = what was applied. Use recording rules whenever an alert query takes >1 second or is used in multiple places.

Silence vs Inhibit¶

Feature	Silence	Inhibition
Type	Manual, time-boxed	Automatic, rule-based
Use	Maintenance windows	Suppress cascading alerts
Config	Alertmanager UI or amtool	alertmanager.yml

# Inhibition: suppress warnings when critical fires
inhibit_rules:
- source_matchers: [severity = critical]
  target_matchers: [severity = warning]
  equal: [namespace, alertname]

LogQL Alerts (Loki)¶

- alert: HighErrorLogs
  expr: |
    sum by(app)(rate({namespace="prod"} |= "error" [1m])) > 10
  for: 5m

- alert: OOMDetected
  expr: count_over_time({namespace="prod"} |= "OOMKilled" [5m]) > 0

Alert Quality Checklist¶

[ ] Has a clear, actionable summary
[ ] Has `for` duration to prevent flapping
[ ] Severity matches actual impact
[ ] Links to a runbook in annotations
[ ] Tested in staging before production
[ ] Won't fire during normal operation
[ ] Has a clear owner/team label
[ ] Recording rules used for expensive queries

Alerting Rules Cheat Sheet¶

Alert Rule Anatomy¶

Essential PromQL Functions for Alerts¶

Common Alert Patterns¶

Error Rate¶

Latency¶

Target Down¶

Disk Prediction¶

Pod Restarts¶

Node Pressure¶

Certificate Expiry¶

Recording Rules (Performance)¶

Alertmanager Routing¶

Silence vs Inhibit¶

LogQL Alerts (Loki)¶

Alert Quality Checklist¶

Pages that link here¶