Thinking Out Loud: Alerting Rules¶
A senior SRE's internal monologue while working through a real alerting task. This isn't a tutorial — it's a window into how experienced engineers actually think.
The Situation¶
The on-call rotation is burning out. The team received 47 pages last week, but only 3 required actual intervention. I've been asked to fix the alerting — reduce noise while keeping the alerts that matter. The alerts are in Prometheus/Alertmanager.
The Monologue¶
47 alerts, 3 actionable. That's a 94% false positive rate. This is the #1 cause of on-call burnout and alert fatigue. When everything pages, nothing pages — people start ignoring alerts, and eventually a real incident gets missed.
Let me start by categorizing what fired last week.
curl -s 'http://alertmanager:9093/api/v2/alerts?silenced=false&inhibited=false' | jq '.[].labels.alertname' | sort | uniq -c | sort -rn | head -15
Wait, that shows current alerts. I need historical data. Let me check the alert history from Prometheus.
curl -s 'http://prometheus:9090/api/v1/query?query=ALERTS{alertstate="firing"}' | jq '.data.result | length'
12 currently firing. Let me look at the specific alert rules and their recent firing patterns.
curl -s 'http://prometheus:9090/api/v1/rules?type=alert' | jq '.data.groups[].rules[] | select(.state=="firing") | {name: .name, state: .state, duration: .duration, labels: .labels.severity}'
I see the usual suspects: HighCPUUsage (threshold too low), PodRestartCount (no window filter), DiskUsageHigh (fires on EBS burst balance changes), HighLatency (no burn-rate awareness). Let me fix these one by one.
Mental Model: The Alerting Quality Framework¶
Every alert should pass three tests: (1) Is it actionable? Can someone DO something when it fires? (2) Is it urgent? Does it need attention NOW or can it wait for business hours? (3) Is it real? Does it represent actual user impact, not just a metric crossing a threshold? If an alert fails any of these tests, it shouldn't page.
Let me pull up the actual alert rules.
Okay, here's the first offender:
- alert: HighCPUUsage
expr: node_cpu_seconds_total{mode="idle"} < 20
for: 5m
labels:
severity: critical
CPU usage > 80% for 5 minutes = page. This is wrong in multiple ways. First, CPU at 80% is not an emergency — it's normal for a well-utilized cluster. Second, 5 minutes is too short — a traffic spike can cause a transient 80% and resolve on its own. Third, this doesn't distinguish between web servers (where high CPU = slow responses) and batch workers (where high CPU = working as designed).
Let me rewrite this:
- alert: HighCPUSaturation
expr: |
(
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
) > 0.95
for: 15m
labels:
severity: warning
annotations:
summary: "CPU saturation above 95% for 15 minutes on {{ $labels.instance }}"
runbook: "https://wiki.internal/runbooks/high-cpu"
Key changes: threshold raised to 95%, window extended to 15 minutes, severity downgraded from critical to warning (high CPU alone doesn't warrant a page — it needs to be combined with user-visible impact to be critical).
Next offender:
- alert: PodRestartCount
expr: kube_pod_container_status_restarts_total > 5
for: 1m
labels:
severity: critical
This fires when the TOTAL restart count exceeds 5, ever. So any pod that has restarted 5 times since it was created — even if those restarts were spread over weeks — triggers a critical page. This is the most common alerting mistake in Kubernetes.
Mental Model: Rate of Change, Not Absolute Value¶
Alert on the rate of change, not absolute counters. A pod with 100 restarts over 6 months is stable. A pod with 3 restarts in 5 minutes is in a crash loop. The counter value is meaningless without a time window. Use
increase()orrate()to measure the rate of change.
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 3
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} restarted {{ $value }} times in the last hour"
Now it alerts when a pod has restarted more than 3 times in the last hour, sustained for 10 minutes. This catches actual crash loops while ignoring normal startup restarts and one-off crashes.
Third offender: HighLatency.
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: critical
P99 latency > 1 second for 5 minutes = page. The problem is that this doesn't account for the error budget. A brief spike in p99 that doesn't burn through a meaningful amount of your SLO is not an emergency. I need to use a burn-rate approach.
Mental Model: Multiwindow, Multi-Burn-Rate Alerts¶
Instead of alerting on raw metric thresholds, alert on SLO burn rate. If your SLO is 99.9% requests under 500ms, calculate how fast you're burning your error budget. A 14x burn rate for 1 hour = you'll exhaust the monthly budget in ~2 days. A 2x burn rate for 6 hours = concerning but not urgent. This approach automatically scales: small blips don't page, sustained degradation does.
- alert: LatencyBudgetBurn
expr: |
(
sum(rate(http_request_duration_seconds_count{le="0.5"}[1h]))
/
sum(rate(http_request_duration_seconds_count[1h]))
) < 0.99
and
(
sum(rate(http_request_duration_seconds_count{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
) < 0.99
for: 2m
labels:
severity: critical
annotations:
summary: "Latency SLO violation: fast burn detected"
This uses a multiwindow approach: both the 1-hour and 5-minute windows must show SLO violation before it fires. The 1-hour window prevents transient spikes from paging. The 5-minute window ensures the problem is still active (not a past event).
Let me also fix the alerting architecture. Right now everything is severity: critical. We need proper severity levels.
The Alertmanager config routes everything to PagerDuty. No severity-based routing. Let me fix this.
route:
receiver: 'slack-warnings'
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 15m
- match:
severity: warning
receiver: 'slack-warnings'
repeat_interval: 4h
group_wait: 5m
Warnings go to Slack, critical goes to PagerDuty. This alone will dramatically reduce pages because most of the current alerts should be warnings, not critical.
Let me apply all these changes and verify the alert count drops.
kubectl apply -f /tmp/updated-prometheus-alerts.yaml
kubectl apply -f /tmp/updated-alertmanager-config.yaml
curl -XPOST http://prometheus:9090/-/reload
curl -XPOST http://alertmanager:9093/-/reload
After the reload, let me check what's firing now.
curl -s 'http://prometheus:9090/api/v1/rules?type=alert' | jq '[.data.groups[].rules[] | select(.state=="firing")] | length'
Down from 12 to 2 firing alerts. And those two are severity: warning, so they go to Slack, not PagerDuty. Zero current pages. Let me keep these changes running for a week and compare the page count.
What Made This Senior-Level¶
| Junior Would... | Senior Does... | Why |
|---|---|---|
| Lower alert thresholds to "catch more issues" | Raise thresholds, extend windows, and add burn-rate calculations | More alerts doesn't mean better monitoring — it means more noise |
| Alert on absolute counter values | Alert on rate of change using increase() and rate() |
Counters only grow — the rate of change is what indicates a problem |
| Have one severity level (critical) | Implement severity-based routing: warnings to Slack, critical to PagerDuty | Not every alert needs to wake someone up at 3 AM |
| Write alerts based on metric thresholds | Write alerts based on SLO burn rate with multiwindow validation | SLO-based alerts naturally scale — transient spikes don't page, sustained violations do |
Key Heuristics Used¶
- Three-Test Filter: Every alert must be actionable (someone can do something), urgent (it needs attention now), and real (it reflects user impact). If it fails any test, it shouldn't page.
- Rate Over Absolute: Alert on rate of change, not absolute values. Counters and cumulative metrics need
rate()orincrease()to be meaningful. - Multiwindow Burn Rate: Use both a long window (1h) to avoid transient noise and a short window (5m) to confirm the issue is ongoing. This eliminates the majority of false positives.
Cross-References¶
- Primer — PromQL fundamentals, alert rule syntax, and Alertmanager routing
- Street Ops — Alert rule debugging, silencing, and Alertmanager config patterns
- Footguns — Alerting on absolute counters, single-threshold alerts without burn rate, and everything-is-critical syndrome