Observability Footguns¶

Mistakes that give you false confidence, alert fatigue, or blind spots during incidents.

1. Alerting on symptoms you've already handled¶

You alert on "pod restarted." Kubernetes already restarts crashlooping pods. You get paged, look at it, the pod restarted, it's fine now. This happens 20 times a day. You stop reading alerts. Then a real problem pages you and you ignore it.

Fix: Alert on customer impact, not infrastructure events. Alert on error rate, latency, and availability — not on pod restarts or node CPU.

Remember: Google's SRE book defines the "Four Golden Signals" for monitoring: latency, traffic, errors, and saturation. Alert on these, not on infrastructure events. A pod restart is an infrastructure event. A spike in p99 latency is a customer-impacting signal.

2. No `for` clause on alerts¶

Your alert fires the instant a metric crosses a threshold. A single-second CPU spike at 3am pages your on-call. The spike was normal — a cron job starting up.

Fix: Always use for: 5m or longer. This means the condition must be true for 5 continuous minutes before alerting. Eliminates transient spikes.

3. `rate()` over too short a window¶

You use rate(http_requests_total[30s]). With a 15-second scrape interval, you only have 2 data points. The rate calculation is noisy and unreliable. Dashboards show wild oscillations.

Fix: Use rate() over at least 4x your scrape interval. For a 15s scrape interval, use rate(metric[1m]) minimum. 5m is a safe default.

4. Cardinality explosion¶

You add a label user_id to your HTTP metrics. You have 1 million users. You now have 1 million time series per metric. Prometheus OOMs. Grafana times out. Your monitoring is down when you need it most.

Fix: Never use unbounded labels (user IDs, request IDs, IP addresses, full URLs). Aggregate in the application. Use logs for high-cardinality data, not metrics.

Under the hood: Prometheus stores each unique combination of metric name + label values as a separate time series. With 10 metrics and a user_id label across 1M users, you have 10M time series. Each series consumes ~1-3KB of RAM in Prometheus. At 10M series, Prometheus needs 10-30GB of RAM just for the TSDB head block. The prometheus_tsdb_head_series metric shows your current count.

5. Dashboard that shows averages instead of percentiles¶

Your dashboard shows "average latency: 50ms." Looks great. But p99 is 5 seconds. 1% of your users are having a terrible experience and your dashboard doesn't show it.

Fix: Always display p50, p95, p99 percentiles alongside averages. Use histograms, not summaries, for latency metrics (histograms are aggregatable).

6. Missing `absent()` alert¶

Your critical service stops exposing metrics. Maybe it crashed, maybe the scrape config is wrong. Your error rate alert shows "no data" — which doesn't fire. No alert fires. The service is down and no one knows.

Fix: Add absent() alerts for critical metrics: absent(up{job="api"}) == 1. This fires when the metric itself disappears.

Gotcha: absent() only works for metrics that should always exist. For metrics that appear intermittently (like error counters that only increment on errors), absent() will fire constantly. Use absent_over_time(up{job="api"}[5m]) to detect metrics that have been missing for a sustained period.

7. Logs with no structure¶

You log fmt.Println("processing request for user"). Good luck finding this in 500GB of logs when you need to filter by user, status code, or latency. You can't aggregate, you can't alert on log content.

Fix: Use structured logging (JSON). Include: timestamp, level, message, request_id, user_id, status_code, duration_ms. Every log line should be machine-parseable.

8. Grafana dashboard with 100 panels¶

You add every metric you can think of to one dashboard. It takes 30 seconds to load. During an incident, you scroll through 100 panels looking for the one that matters. You find it 15 minutes into the outage.

Fix: Build focused dashboards: Golden Signals (rate, errors, latency, saturation) on one dashboard. Drill-down dashboards for specific services. Use dashboard variables for filtering.

9. Prometheus scraping too frequently¶

You set scrape_interval: 5s on all targets to get "real-time" data. Your Prometheus now makes 12 requests per minute per target. With 500 targets, that's 6000 scrapes per minute. Prometheus is overloaded, scrapes start timing out, and your metrics have gaps.

Fix: 15-30s scrape interval is fine for most workloads. Only reduce for high-frequency trading or similar. Monitor prometheus_target_scrape_pool_exceeded_target_limit and scrape_duration_seconds.

10. Alert routing that sends everything to one channel¶

All alerts go to #alerts in Slack. Critical database alerts are buried between informational disk warnings and resolved notifications. The channel has 500 unread messages. Nobody reads it.

Fix: Route by severity. Critical → PagerDuty. Warning → dedicated Slack channel. Info → suppressed or batched daily. Use group_by and repeat_interval to reduce noise.

11. Retention too short for trend analysis¶

You set Prometheus retention to 7 days to save disk. You can't do week-over-week comparisons, can't see seasonal patterns, can't answer "has this been degrading over the past month?"

Fix: Use recording rules for long-term aggregates. Ship to a long-term store (Thanos, Cortex, Mimir) with 1-year retention for aggregated data. Keep raw data for 15-30 days.

Debug clue: prometheus_tsdb_retention_limit_bytes and prometheus_tsdb_storage_blocks_bytes show how much disk Prometheus is using. If blocks_bytes is at or near retention_limit_bytes, old data is being dropped. Use promtool tsdb analyze /path/to/data to see block-level detail including time ranges and series counts per block.