The Monitoring That Lied
- lesson
- metric-lag
- counter-resets
- percentile-math
- stale-scrapes
- dashboard-design
- l2 ---# The Monitoring That Lied
Topics: metric lag, counter resets, percentile math, stale scrapes, dashboard design Level: L2 (Operations) Time: 45–60 minutes Prerequisites: Basic Prometheus/monitoring familiarity helpful
The Mission¶
The dashboard is green. CPU at 30%. Memory at 60%. Error rate 0.1%. p99 latency 150ms. Everything looks fine.
But users are complaining. Orders are failing. The support queue is growing. You stare at the dashboard. It stares back. All green.
The monitoring is lying. Not maliciously — but the metrics you're looking at don't represent what users are experiencing. This lesson covers the ways monitoring deceives you and how to build dashboards you can actually trust.
Lie 1: Metric Lag¶
Your Prometheus scrapes every 15 seconds. Your alert requires the condition to be true for:
5m. Your dashboard auto-refreshes every 30 seconds.
Problem starts: 14:00:00
First scrape catches: 14:00:15 (15s lag)
Alert condition met: 14:05:15 (5m for: duration)
Alert fires: 14:05:15
You see it: 14:05:30 (next dashboard refresh)
You investigate: 14:07:00 (open laptop, read alert)
Total: 7 minutes from problem to human awareness
In those 7 minutes, if your error rate was 10%, you served 420,000 errors (at 1,000 req/sec). Users experienced the outage for 7 minutes while your monitoring showed green for the first 5.
Fix: Reduce
for:on critical alerts (3m instead of 5m). Use shorter scrape intervals for critical services (5s instead of 15s). Accept more false positives in exchange for faster detection. Or: use real-time alerting (Datadog, Honeycomb) where lag is <30 seconds.
Lie 2: Rate() Over Counter Resets¶
Prometheus counters reset to zero when a process restarts. rate() handles this correctly
— it detects resets and calculates the rate across them. But:
Counter at 14:00: 50,000
Process restarts at 14:01
Counter at 14:02: 100 (reset to 0, then 100 new requests)
rate() calculates: "50,000 → 100, that's a rate decrease"
Dashboard shows: error rate DROPPED (because the counter reset)
For a few scrape intervals after a restart, rate() produces misleading numbers. A service
that restarted because of errors looks like it suddenly got better.
Fix: Alert on
changes(up[15m]) > 0to detect restarts. Don't trust rate calculations for 2-3 scrape intervals after a restart.
Lie 3: Percentile Aggregation¶
You have 10 backend instances. Each reports p99 latency. Your dashboard averages them:
Instance 1 p99: 120ms
Instance 2 p99: 130ms
...
Instance 10 p99: 140ms
Dashboard "average p99": 130ms
Actual p99 (all requests combined): 450ms
The average of percentiles is NOT the percentile of the average. If one instance handles heavy requests (slow) and nine handle light requests (fast), averaging their percentiles hides the slow instance.
Fix: Use
histogram_quantile()on aggregated histogram buckets, notavg()on pre-computed percentiles. This computes the true percentile across all instances:
Lie 4: The Dashboard Shows Green Because Nothing Is Emitting¶
Your service crashed. It stopped emitting metrics. Prometheus scrapes the target — no
response. The up metric goes to 0.
But your error rate alert calculates rate(errors[5m]) / rate(total[5m]). With no data,
both rates are undefined. The alert evaluates to "no data" — and many alert configurations
treat "no data" as "not firing."
The service is completely down. The error rate dashboard shows... nothing. Not red. Not green. Just empty. Nobody notices.
Fix: Always add an
absent()alert for critical services:
Lie 5: Averages Hide Outliers¶
Average response time: 150ms (looks great!)
Distribution:
90% of requests: 50ms
9% of requests: 200ms
1% of requests: 10,000ms (10 seconds!)
At 1,000 req/sec, that 1% = 10 users per second waiting 10 seconds.
600 users per minute with terrible experience.
Average says: 150ms. Fine!
Fix: Never alert on averages. Use percentiles: p50 (median), p95, p99. And check the distribution, not just one percentile. A system with p99=200ms and p99.9=30s has a different problem than p99=200ms and p99.9=210ms.
Lie 6: Dashboard Time Range Hides Spikes¶
Your dashboard shows the last 1 hour with 5-minute resolution. A 30-second spike gets averaged into a 5-minute bucket and disappears:
5-minute bucket: 14:00 - 14:05
14:00-14:01: 50ms (normal)
14:01-14:02: 5000ms (spike!)
14:02-14:05: 50ms (normal)
Average for bucket: 1050ms (shows as a small bump)
Reality: users experienced 5-second latency for a full minute
Fix: Use higher-resolution dashboards for investigation (15-second or 1-minute intervals). Keep 1-hour and 24-hour views for trends, but switch to fine-grained views when investigating incidents.
War Story: A team's dashboard showed "no error spike" during a reported outage. Their Grafana panel used
rate(errors[5m])with a 5-minute resolution. The actual outage was a 45-second burst of 100% errors. Averaged over 5 minutes, the burst became a 15% blip that fell below their alert threshold. 45 seconds of total outage, invisible in the dashboard.
Building Trustworthy Dashboards¶
The RED Method (for services)¶
| Metric | What it measures | Alert on |
|---|---|---|
| Rate | Requests per second | Sudden drops (traffic fell off a cliff) |
| Errors | Error rate (%) | > SLO threshold (e.g., > 0.1% for 5 minutes) |
| Duration | Latency (p50, p95, p99) | p99 > SLO target |
The USE Method (for infrastructure)¶
| For each resource | What to check |
|---|---|
| Utilization | How busy is it? (CPU %, memory %, disk %) |
| Saturation | Is work queuing? (run queue, connection pool wait) |
| Errors | Are operations failing? (disk errors, network drops) |
Dashboard layout¶
Row 1: The Golden Signals (RED)
- Request rate (is traffic normal?)
- Error rate (are things failing?)
- Latency (p50, p95, p99)
Row 2: Dependencies
- Database latency + connection pool usage
- Redis hit rate + memory
- External API latency
Row 3: Infrastructure (USE)
- CPU utilization per node
- Memory available (not "used" — available)
- Disk I/O latency
Row 4: Business metrics
- Orders per minute
- Revenue per minute
- Active users
Flashcard Check¶
Q1: Error rate shows 0% but users see errors. What's happening?
The service crashed and stopped emitting metrics. With no data, the error rate calculation returns nothing (not zero). Add
absent()alerts.
Q2: Average latency is 150ms. Should you be concerned?
Maybe. Average hides outliers. If 1% of requests take 10 seconds, 600 users/minute have terrible experience. Check p99 and p99.9, not average.
Q3: You average p99 latency across 10 instances. Is this accurate?
No. Average of percentiles ≠ percentile of the average. Use
histogram_quantile()on aggregated buckets for the true cross-instance percentile.
Q4: A 45-second outage doesn't appear in the dashboard. Why?
Dashboard resolution is 5 minutes. The 45-second spike is averaged into a 5-minute bucket and becomes a small blip. Use finer-grained resolution for investigation.
Takeaways¶
-
Metrics have lag. 15s scrape + 5m
for:+ 30s refresh = 7 minutes minimum detection time. Design for this delay. -
absent()catches silent failures. A crashed service emits nothing. No data ≠ no errors. Addabsent()for every critical service. -
Never average percentiles. Use
histogram_quantile()on aggregated buckets. The average of p99s is mathematically meaningless. -
Never alert on averages. Use p95 and p99. Averages hide the worst user experiences.
-
Dashboard resolution hides spikes. 5-minute resolution averages away 30-second outages. Use fine-grained views for investigation.
Related Lessons¶
- Prometheus and the Art of Not Alerting — building alerts that work
- The Mysterious Latency Spike — investigating what the dashboard found
- How Incident Response Actually Works — acting on monitoring signals