Alerting Rules Footguns¶
Mistakes that cause alert fatigue, missed incidents, or pages that wake you up for nothing.
1. Alerting on infrastructure instead of customer impact¶
You alert on "node CPU > 80%." CPU hits 85% at 3am from a batch job. You get paged. No customer impact. This happens 3 times a week. Eventually you stop responding to alerts. Then a real outage happens and you miss it.
Fix: Alert on error rate, latency, and availability — things customers feel. Monitor infrastructure metrics for dashboards and capacity planning, not paging.
2. No for duration — alerting on blips¶
Your alert fires immediately when the condition is true. A 1-second latency spike pages you. By the time you look, everything is normal.
Fix: Always use for: 5m or longer. The condition must be continuously true for the duration. 5m for critical, 15m for warning.
3. Alerting on rate() with insufficient data¶
Your error rate alert: rate(errors[1m]) > 0.01. During low traffic (2 requests/minute), a single error gives a rate of 0.5 — 50x your threshold. You get paged because 1 out of 2 requests failed.
Fix: Add a minimum traffic threshold: rate(errors[5m]) / rate(total[5m]) > 0.01 AND rate(total[5m]) > 1. Don't alert on error rates when traffic is negligible.
4. Missing absent() for critical metrics¶
Your application crashes and stops emitting metrics. Your error rate alert shows "no data" — which is neither true nor false. The alert doesn't fire. Nobody knows the service is down.
Fix: Add an absent() alert for every critical metric: absent(up{job="api"} == 1). This fires when the metric itself disappears.
5. Every alert goes to the same channel¶
All alerts route to #alerts in Slack. Critical database alerts are buried between resolved pod restart notifications. The channel is permanently unread.
Fix: Route by severity: Critical → PagerDuty (immediate response), Warning → team Slack channel, Info → suppressed or weekly digest. Use group_by to batch related alerts.
6. Recording rule that hides the real metric¶
You create a recording rule job:http_requests:rate5m and use it in alerts and dashboards. Someone changes the recording rule expression. Now your alert threshold means something different, but the alert name and description haven't changed.
Fix: Document recording rule expressions. Version them. Alert on the raw metric when possible, use recording rules for dashboard performance. If alerting on a recording rule, keep it simple.
7. Alert that can never resolve¶
Your alert: changes(kube_pod_container_status_restarts_total[1h]) > 5. A pod restarts 6 times and the alert fires. The pod stabilizes. But the changes() function looks at a 1-hour window. The alert stays active for an hour after the issue is fixed. Your on-call stares at a resolved problem.
Fix: Choose alert expressions that naturally resolve when the problem is fixed. Use rate() instead of changes() or increase() with long windows. Test alert resolution, not just firing.
8. Alertmanager group_wait too long¶
You set group_wait: 5m to batch alerts. A critical outage fires an alert. Alertmanager waits 5 minutes to see if more alerts arrive before sending. You get paged 5 minutes late.
Fix: Set low group_wait for critical alerts (30s). Use longer group_wait for warning/info. Different route configurations can have different timing.
9. Silencing all alerts for maintenance and forgetting¶
You silence all alerts for a 2-hour maintenance window. Maintenance finishes in 30 minutes. You forget to remove the silence. A real incident happens during the remaining 90 minutes. No alerts.
Fix: Set silences with the minimum necessary duration. Use the createdBy and comment fields. Review active silences after maintenance. Alert if silences exceed their expected duration.
10. predict_linear() with bad extrapolation¶
You alert: predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0. A log burst fills disk quickly for 10 minutes, then stops. The linear extrapolation predicts disk full in 4 hours. You get paged. But the burst is over and disk usage is stable.
Fix: Use longer lookback windows for prediction (6h, 24h). Combine with current utilization: predict_linear(...) < 0 AND node_filesystem_avail_bytes < threshold. Predict is a hint, not a certainty.