The Monitoring That Lied

lesson
metric-lag
counter-resets
percentile-math
stale-scrapes
dashboard-design
l2 ---# The Monitoring That Lied

Topics: metric lag, counter resets, percentile math, stale scrapes, dashboard design Level: L2 (Operations) Time: 45–60 minutes Prerequisites: Basic Prometheus/monitoring familiarity helpful

The Mission¶

The dashboard is green. CPU at 30%. Memory at 60%. Error rate 0.1%. p99 latency 150ms. Everything looks fine.

But users are complaining. Orders are failing. The support queue is growing. You stare at the dashboard. It stares back. All green.

The monitoring is lying. Not maliciously — but the metrics you're looking at don't represent what users are experiencing. This lesson covers the ways monitoring deceives you and how to build dashboards you can actually trust.

Lie 1: Metric Lag¶

Your Prometheus scrapes every 15 seconds. Your alert requires the condition to be true for: 5m. Your dashboard auto-refreshes every 30 seconds.

Problem starts:       14:00:00
First scrape catches: 14:00:15 (15s lag)
Alert condition met:  14:05:15 (5m for: duration)
Alert fires:          14:05:15
You see it:           14:05:30 (next dashboard refresh)
You investigate:      14:07:00 (open laptop, read alert)

Total: 7 minutes from problem to human awareness

In those 7 minutes, if your error rate was 10%, you served 420,000 errors (at 1,000 req/sec). Users experienced the outage for 7 minutes while your monitoring showed green for the first 5.

Fix: Reduce for: on critical alerts (3m instead of 5m). Use shorter scrape intervals for critical services (5s instead of 15s). Accept more false positives in exchange for faster detection. Or: use real-time alerting (Datadog, Honeycomb) where lag is <30 seconds.

Lie 2: Rate() Over Counter Resets¶

Prometheus counters reset to zero when a process restarts. rate() handles this correctly — it detects resets and calculates the rate across them. But:

Counter at 14:00: 50,000
Process restarts at 14:01
Counter at 14:02: 100 (reset to 0, then 100 new requests)

rate() calculates: "50,000 → 100, that's a rate decrease"
Dashboard shows: error rate DROPPED (because the counter reset)

For a few scrape intervals after a restart, rate() produces misleading numbers. A service that restarted because of errors looks like it suddenly got better.

Fix: Alert on changes(up[15m]) > 0 to detect restarts. Don't trust rate calculations for 2-3 scrape intervals after a restart.

Lie 3: Percentile Aggregation¶

You have 10 backend instances. Each reports p99 latency. Your dashboard averages them:

Instance 1 p99: 120ms
Instance 2 p99: 130ms
...
Instance 10 p99: 140ms

Dashboard "average p99": 130ms
Actual p99 (all requests combined): 450ms

The average of percentiles is NOT the percentile of the average. If one instance handles heavy requests (slow) and nine handle light requests (fast), averaging their percentiles hides the slow instance.

Fix: Use histogram_quantile() on aggregated histogram buckets, not avg() on pre-computed percentiles. This computes the true percentile across all instances:
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Lie 4: The Dashboard Shows Green Because Nothing Is Emitting¶

Your service crashed. It stopped emitting metrics. Prometheus scrapes the target — no response. The up metric goes to 0.

But your error rate alert calculates rate(errors[5m]) / rate(total[5m]). With no data, both rates are undefined. The alert evaluates to "no data" — and many alert configurations treat "no data" as "not firing."

The service is completely down. The error rate dashboard shows... nothing. Not red. Not green. Just empty. Nobody notices.

Fix: Always add an absent() alert for critical services:
- alert: ServiceDown
  expr: absent(up{job="api"} == 1)
  for: 2m

Lie 5: Averages Hide Outliers¶

Average response time: 150ms (looks great!)

Distribution:
  90% of requests: 50ms
  9% of requests:  200ms
  1% of requests:  10,000ms (10 seconds!)

At 1,000 req/sec, that 1% = 10 users per second waiting 10 seconds.
600 users per minute with terrible experience.
Average says: 150ms. Fine!

Fix: Never alert on averages. Use percentiles: p50 (median), p95, p99. And check the distribution, not just one percentile. A system with p99=200ms and p99.9=30s has a different problem than p99=200ms and p99.9=210ms.

Lie 6: Dashboard Time Range Hides Spikes¶

Your dashboard shows the last 1 hour with 5-minute resolution. A 30-second spike gets averaged into a 5-minute bucket and disappears:

5-minute bucket: 14:00 - 14:05
  14:00-14:01: 50ms (normal)
  14:01-14:02: 5000ms (spike!)
  14:02-14:05: 50ms (normal)

Average for bucket: 1050ms (shows as a small bump)
Reality: users experienced 5-second latency for a full minute

Fix: Use higher-resolution dashboards for investigation (15-second or 1-minute intervals). Keep 1-hour and 24-hour views for trends, but switch to fine-grained views when investigating incidents.

War Story: A team's dashboard showed "no error spike" during a reported outage. Their Grafana panel used rate(errors[5m]) with a 5-minute resolution. The actual outage was a 45-second burst of 100% errors. Averaged over 5 minutes, the burst became a 15% blip that fell below their alert threshold. 45 seconds of total outage, invisible in the dashboard.

Building Trustworthy Dashboards¶

The RED Method (for services)¶

Metric	What it measures	Alert on
Rate	Requests per second	Sudden drops (traffic fell off a cliff)
Errors	Error rate (%)	> SLO threshold (e.g., > 0.1% for 5 minutes)
Duration	Latency (p50, p95, p99)	p99 > SLO target

The USE Method (for infrastructure)¶

For each resource	What to check
Utilization	How busy is it? (CPU %, memory %, disk %)
Saturation	Is work queuing? (run queue, connection pool wait)
Errors	Are operations failing? (disk errors, network drops)

Dashboard layout¶

Row 1: The Golden Signals (RED)
  - Request rate (is traffic normal?)
  - Error rate (are things failing?)
  - Latency (p50, p95, p99)

Row 2: Dependencies
  - Database latency + connection pool usage
  - Redis hit rate + memory
  - External API latency

Row 3: Infrastructure (USE)
  - CPU utilization per node
  - Memory available (not "used" — available)
  - Disk I/O latency

Row 4: Business metrics
  - Orders per minute
  - Revenue per minute
  - Active users

Flashcard Check¶

Q1: Error rate shows 0% but users see errors. What's happening?

The service crashed and stopped emitting metrics. With no data, the error rate calculation returns nothing (not zero). Add absent() alerts.

Q2: Average latency is 150ms. Should you be concerned?

Maybe. Average hides outliers. If 1% of requests take 10 seconds, 600 users/minute have terrible experience. Check p99 and p99.9, not average.

Q3: You average p99 latency across 10 instances. Is this accurate?

No. Average of percentiles ≠ percentile of the average. Use histogram_quantile() on aggregated buckets for the true cross-instance percentile.

Q4: A 45-second outage doesn't appear in the dashboard. Why?

Dashboard resolution is 5 minutes. The 45-second spike is averaged into a 5-minute bucket and becomes a small blip. Use finer-grained resolution for investigation.

Takeaways¶

Metrics have lag. 15s scrape + 5m for: + 30s refresh = 7 minutes minimum detection time. Design for this delay.
absent() catches silent failures. A crashed service emits nothing. No data ≠ no errors. Add absent() for every critical service.
Never average percentiles. Use histogram_quantile() on aggregated buckets. The average of p99s is mathematically meaningless.
Never alert on averages. Use p95 and p99. Averages hide the worst user experiences.
Dashboard resolution hides spikes. 5-minute resolution averages away 30-second outages. Use fine-grained views for investigation.

Prometheus and the Art of Not Alerting — building alerts that work
The Mysterious Latency Spike — investigating what the dashboard found
How Incident Response Actually Works — acting on monitoring signals