Skip to content

Thinking Out Loud: Alerting Rules

A senior SRE's internal monologue while working through a real alerting task. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

The on-call rotation is burning out. The team received 47 pages last week, but only 3 required actual intervention. I've been asked to fix the alerting — reduce noise while keeping the alerts that matter. The alerts are in Prometheus/Alertmanager.

The Monologue

47 alerts, 3 actionable. That's a 94% false positive rate. This is the #1 cause of on-call burnout and alert fatigue. When everything pages, nothing pages — people start ignoring alerts, and eventually a real incident gets missed.

Let me start by categorizing what fired last week.

curl -s 'http://alertmanager:9093/api/v2/alerts?silenced=false&inhibited=false' | jq '.[].labels.alertname' | sort | uniq -c | sort -rn | head -15

Wait, that shows current alerts. I need historical data. Let me check the alert history from Prometheus.

curl -s 'http://prometheus:9090/api/v1/query?query=ALERTS{alertstate="firing"}' | jq '.data.result | length'

12 currently firing. Let me look at the specific alert rules and their recent firing patterns.

curl -s 'http://prometheus:9090/api/v1/rules?type=alert' | jq '.data.groups[].rules[] | select(.state=="firing") | {name: .name, state: .state, duration: .duration, labels: .labels.severity}'

I see the usual suspects: HighCPUUsage (threshold too low), PodRestartCount (no window filter), DiskUsageHigh (fires on EBS burst balance changes), HighLatency (no burn-rate awareness). Let me fix these one by one.

Mental Model: The Alerting Quality Framework

Every alert should pass three tests: (1) Is it actionable? Can someone DO something when it fires? (2) Is it urgent? Does it need attention NOW or can it wait for business hours? (3) Is it real? Does it represent actual user impact, not just a metric crossing a threshold? If an alert fails any of these tests, it shouldn't page.

Let me pull up the actual alert rules.

kubectl get configmap prometheus-alerts -n monitoring -o yaml | head -100

Okay, here's the first offender:

- alert: HighCPUUsage
  expr: node_cpu_seconds_total{mode="idle"} < 20
  for: 5m
  labels:
    severity: critical

CPU usage > 80% for 5 minutes = page. This is wrong in multiple ways. First, CPU at 80% is not an emergency — it's normal for a well-utilized cluster. Second, 5 minutes is too short — a traffic spike can cause a transient 80% and resolve on its own. Third, this doesn't distinguish between web servers (where high CPU = slow responses) and batch workers (where high CPU = working as designed).

Let me rewrite this:

- alert: HighCPUSaturation
  expr: |
    (
      1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
    ) > 0.95
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "CPU saturation above 95% for 15 minutes on {{ $labels.instance }}"
    runbook: "https://wiki.internal/runbooks/high-cpu"

Key changes: threshold raised to 95%, window extended to 15 minutes, severity downgraded from critical to warning (high CPU alone doesn't warrant a page — it needs to be combined with user-visible impact to be critical).

Next offender:

- alert: PodRestartCount
  expr: kube_pod_container_status_restarts_total > 5
  for: 1m
  labels:
    severity: critical

This fires when the TOTAL restart count exceeds 5, ever. So any pod that has restarted 5 times since it was created — even if those restarts were spread over weeks — triggers a critical page. This is the most common alerting mistake in Kubernetes.

Mental Model: Rate of Change, Not Absolute Value

Alert on the rate of change, not absolute counters. A pod with 100 restarts over 6 months is stable. A pod with 3 restarts in 5 minutes is in a crash loop. The counter value is meaningless without a time window. Use increase() or rate() to measure the rate of change.

- alert: PodCrashLooping
  expr: |
    increase(kube_pod_container_status_restarts_total[1h]) > 3
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Pod {{ $labels.pod }} restarted {{ $value }} times in the last hour"

Now it alerts when a pod has restarted more than 3 times in the last hour, sustained for 10 minutes. This catches actual crash loops while ignoring normal startup restarts and one-off crashes.

Third offender: HighLatency.

- alert: HighLatency
  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
  for: 5m
  labels:
    severity: critical

P99 latency > 1 second for 5 minutes = page. The problem is that this doesn't account for the error budget. A brief spike in p99 that doesn't burn through a meaningful amount of your SLO is not an emergency. I need to use a burn-rate approach.

Mental Model: Multiwindow, Multi-Burn-Rate Alerts

Instead of alerting on raw metric thresholds, alert on SLO burn rate. If your SLO is 99.9% requests under 500ms, calculate how fast you're burning your error budget. A 14x burn rate for 1 hour = you'll exhaust the monthly budget in ~2 days. A 2x burn rate for 6 hours = concerning but not urgent. This approach automatically scales: small blips don't page, sustained degradation does.

- alert: LatencyBudgetBurn
  expr: |
    (
      sum(rate(http_request_duration_seconds_count{le="0.5"}[1h]))
      /
      sum(rate(http_request_duration_seconds_count[1h]))
    ) < 0.99
    and
    (
      sum(rate(http_request_duration_seconds_count{le="0.5"}[5m]))
      /
      sum(rate(http_request_duration_seconds_count[5m]))
    ) < 0.99
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Latency SLO violation: fast burn detected"

This uses a multiwindow approach: both the 1-hour and 5-minute windows must show SLO violation before it fires. The 1-hour window prevents transient spikes from paging. The 5-minute window ensures the problem is still active (not a past event).

Let me also fix the alerting architecture. Right now everything is severity: critical. We need proper severity levels.

kubectl get configmap alertmanager-config -n monitoring -o yaml | head -60

The Alertmanager config routes everything to PagerDuty. No severity-based routing. Let me fix this.

route:
  receiver: 'slack-warnings'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 15m
    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 4h
      group_wait: 5m

Warnings go to Slack, critical goes to PagerDuty. This alone will dramatically reduce pages because most of the current alerts should be warnings, not critical.

Let me apply all these changes and verify the alert count drops.

kubectl apply -f /tmp/updated-prometheus-alerts.yaml
kubectl apply -f /tmp/updated-alertmanager-config.yaml
curl -XPOST http://prometheus:9090/-/reload
curl -XPOST http://alertmanager:9093/-/reload

After the reload, let me check what's firing now.

curl -s 'http://prometheus:9090/api/v1/rules?type=alert' | jq '[.data.groups[].rules[] | select(.state=="firing")] | length'

Down from 12 to 2 firing alerts. And those two are severity: warning, so they go to Slack, not PagerDuty. Zero current pages. Let me keep these changes running for a week and compare the page count.

What Made This Senior-Level

Junior Would... Senior Does... Why
Lower alert thresholds to "catch more issues" Raise thresholds, extend windows, and add burn-rate calculations More alerts doesn't mean better monitoring — it means more noise
Alert on absolute counter values Alert on rate of change using increase() and rate() Counters only grow — the rate of change is what indicates a problem
Have one severity level (critical) Implement severity-based routing: warnings to Slack, critical to PagerDuty Not every alert needs to wake someone up at 3 AM
Write alerts based on metric thresholds Write alerts based on SLO burn rate with multiwindow validation SLO-based alerts naturally scale — transient spikes don't page, sustained violations do

Key Heuristics Used

  1. Three-Test Filter: Every alert must be actionable (someone can do something), urgent (it needs attention now), and real (it reflects user impact). If it fails any test, it shouldn't page.
  2. Rate Over Absolute: Alert on rate of change, not absolute values. Counters and cumulative metrics need rate() or increase() to be meaningful.
  3. Multiwindow Burn Rate: Use both a long window (1h) to avoid transient noise and a short window (5m) to confirm the issue is ongoing. This eliminates the majority of false positives.

Cross-References

  • Primer — PromQL fundamentals, alert rule syntax, and Alertmanager routing
  • Street Ops — Alert rule debugging, silencing, and Alertmanager config patterns
  • Footguns — Alerting on absolute counters, single-threshold alerts without burn rate, and everything-is-critical syndrome