Pattern: Percentile Blindness¶
ID: FP-043 Family: Observability Gap Frequency: Very Common Blast Radius: Single Service (monitoring blind spot) Detection Difficulty: Actively Misleading
The Shape¶
Dashboards and alerts that use mean (average) latency hide the true user experience. A mean of 50ms can coexist with a p99 of 5 seconds if 99% of requests are fast and 1% are very slow. The 1% "tail" affects real users and is often caused by a specific code path, a slow database query, or a garbage collection pause. Average-based monitoring declares the service healthy while 1% of users experience 5-second waits.
How You'll See It¶
In Kubernetes¶
Grafana dashboard shows avg(rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])) = 45ms.
The SLO says "p99 latency < 500ms." Nobody has set up p99 monitoring. The avg looks
fine. Users are complaining about "slow" responses. Investigation: histogram_quantile(0.99, ...) = 4.2s.
In Linux/Infrastructure¶
Application log analysis using average response time shows 120ms/request. This is the average across all requests. Investigation using percentile analysis (sort + tail): p99 = 3.8s, p999 = 12s. A slow DB query affects 1 in 1,000 requests; it's invisible in the average but very real to users hitting it.
In CI/CD¶
Build time monitoring shows "average build time: 8 minutes." The SLA is "builds under 15 minutes." CI appears healthy. A flaky test that occasionally takes 25 minutes affects the p99 but not the average. Engineers don't notice until a user complains their PR has been building for 40 minutes.
The Tell¶
Average latency is low and stable; p99 is high or variable. Users complain about "occasional slowness" while dashboards show green.
histogram_quantile(0.99, ...)reveals high tail latency hidden byavg(...).
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Occasional user complaints (dismiss as "isolated") | Systematic tail latency | histogram_quantile(0.99) confirms the tail is consistent, not isolated |
| Healthy service based on average | Tail latency problem | Average is misleading; p99 tells the true story for affected users |
| Random network issues | Slow code path on specific conditions | p99 latency is consistently high for specific endpoints or time windows |
The Fix (Generic)¶
- Immediate: Calculate p50, p95, p99 from existing Prometheus histograms:
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)). - Short-term: Replace all "average latency" panels with p50/p95/p99 panels; set SLOs on p99, not on mean.
- Long-term: Use histogram metrics (
_bucket,_sum,_count) in applications, not simple counters; define SLOs as error budgets on p99 latency and error rate.
Real-World Examples¶
- Example 1: Search service: avg latency 40ms, p99 6.5s. The avg was dragged down by 99% of fast (in-cache) results; 1% of queries hit the database (no cache hit) and were slow. 6,000 users/minute experiencing 6.5s waits while the dashboard showed "healthy."
- Example 2: Payment API: avg 200ms, p99 12s. 1% of payments triggered a fraud check synchronous call that was slow. Average-based SLO: green. p99-based SLO: severely breached. Discovered only after a compliance audit of payment processing times.
War Story¶
Our SRE team had 30 Grafana dashboards. Every one showed average latency. Our SLO: "average response time < 200ms." We were green. Users were complaining. We added p99 as an experiment: 4.2s. We'd been meeting our average-based SLO while 1% of our users experienced 4-second waits for months. The root cause: database queries without indexes. The slow queries affected a small percentage of requests (those that triggered a particular code path). Average: invisible. p99: obvious. We replaced every avg latency panel with p50/p99. Median was 45ms; p99 was 4.2s. Fixed the indexes; p99 dropped to 180ms.
Cross-References¶
- Topic Packs: observability-deep-dive, alerting-rules
- Footguns: observability-deep-dive/footguns.md — "Dashboard showing averages not percentiles"
- Case Studies: ops-archaeology/09-monitoring-gap/
- Related Patterns: FP-041 (alerting on restart — another monitoring quality issue), FP-024 (health check lying — service appears healthy when it isn't)