Anti-Primer: eBPF Observability¶

Everything that can go wrong, will — and in this story, it does.

The Setup¶

An SRE team is setting up eBPF Observability for a distributed system serving 10,000 requests per second. The launch is in 3 days and monitoring is the last item on the checklist. The team copies configurations from a much smaller service.

The Timeline¶

Hour 0: Alert Fatigue from Noisy Thresholds¶

Sets alert thresholds too low, generating hundreds of alerts per day. The deadline was looming, and this seemed like the fastest path forward. But the result is on-call engineer ignores alerts; a real incident is missed among the noise for 45 minutes.

Footgun #1: Alert Fatigue from Noisy Thresholds — sets alert thresholds too low, generating hundreds of alerts per day, leading to on-call engineer ignores alerts; a real incident is missed among the noise for 45 minutes.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: No Monitoring of Monitoring¶

Sets up all the dashboards but does not monitor the monitoring system itself. Under time pressure, the team chose speed over caution. But the result is monitoring system goes down silently; a production incident produces zero alerts.

Footgun #2: No Monitoring of Monitoring — sets up all the dashboards but does not monitor the monitoring system itself, leading to monitoring system goes down silently; a production incident produces zero alerts.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Cardinality Explosion¶

Adds high-cardinality labels (user ID, request ID) to metrics. Nobody pushed back because the shortcut looked harmless in the moment. But the result is metric storage fills up in days; queries time out; dashboards become unusable.

Footgun #3: Cardinality Explosion — adds high-cardinality labels (user ID, request ID) to metrics, leading to metric storage fills up in days; queries time out; dashboards become unusable.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: Dashboard Without Context¶

Creates dashboards showing raw numbers without baselines or annotations. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is during an incident, nobody can tell if the graph shows normal or abnormal behavior.

Footgun #4: Dashboard Without Context — creates dashboards showing raw numbers without baselines or annotations, leading to during an incident, nobody can tell if the graph shows normal or abnormal behavior.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem¶

Root Cause Chain¶

#	Mistake	Consequence	Could Have Been Prevented By
1	Alert Fatigue from Noisy Thresholds	On-call engineer ignores alerts; a real incident is missed among the noise for 45 minutes	Primer: Tune thresholds based on baseline data; use multi-window burn-rate alerting
2	No Monitoring of Monitoring	Monitoring system goes down silently; a production incident produces zero alerts	Primer: Dead man's switch alerts; external health check for the monitoring stack
3	Cardinality Explosion	Metric storage fills up in days; queries time out; dashboards become unusable	Primer: Never use unbounded values as metric labels; use logs for high-cardinality data
4	Dashboard Without Context	During an incident, nobody can tell if the graph shows normal or abnormal behavior	Primer: Include baseline bands, deployment markers, and SLO targets on all dashboards

Damage Report¶

Downtime: Monitoring blind spot lasting 2-12 hours
Data loss: None directly, but undetected incidents may cause downstream data loss
Customer impact: If a real incident occurs during the blind spot, customer impact goes undetected
Engineering time to remediate: 8-16 engineer-hours to restore monitoring and backfill gaps
Reputation cost: On-call team loses confidence in alerting; incident response times degrade

What the Primer Teaches¶

Footgun #1: If the engineer had read the primer, section on alert fatigue from noisy thresholds, they would have learned: Tune thresholds based on baseline data; use multi-window burn-rate alerting.
Footgun #2: If the engineer had read the primer, section on no monitoring of monitoring, they would have learned: Dead man's switch alerts; external health check for the monitoring stack.
Footgun #3: If the engineer had read the primer, section on cardinality explosion, they would have learned: Never use unbounded values as metric labels; use logs for high-cardinality data.
Footgun #4: If the engineer had read the primer, section on dashboard without context, they would have learned: Include baseline bands, deployment markers, and SLO targets on all dashboards.

Cross-References¶

Primer — The right way
Footguns — The mistakes catalogued
Street Ops — How to do it in practice