Skip to content

Pattern: Percentile Blindness

ID: FP-043 Family: Observability Gap Frequency: Very Common Blast Radius: Single Service (monitoring blind spot) Detection Difficulty: Actively Misleading

The Shape

Dashboards and alerts that use mean (average) latency hide the true user experience. A mean of 50ms can coexist with a p99 of 5 seconds if 99% of requests are fast and 1% are very slow. The 1% "tail" affects real users and is often caused by a specific code path, a slow database query, or a garbage collection pause. Average-based monitoring declares the service healthy while 1% of users experience 5-second waits.

How You'll See It

In Kubernetes

Grafana dashboard shows avg(rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])) = 45ms. The SLO says "p99 latency < 500ms." Nobody has set up p99 monitoring. The avg looks fine. Users are complaining about "slow" responses. Investigation: histogram_quantile(0.99, ...) = 4.2s.

In Linux/Infrastructure

Application log analysis using average response time shows 120ms/request. This is the average across all requests. Investigation using percentile analysis (sort + tail): p99 = 3.8s, p999 = 12s. A slow DB query affects 1 in 1,000 requests; it's invisible in the average but very real to users hitting it.

In CI/CD

Build time monitoring shows "average build time: 8 minutes." The SLA is "builds under 15 minutes." CI appears healthy. A flaky test that occasionally takes 25 minutes affects the p99 but not the average. Engineers don't notice until a user complains their PR has been building for 40 minutes.

The Tell

Average latency is low and stable; p99 is high or variable. Users complain about "occasional slowness" while dashboards show green. histogram_quantile(0.99, ...) reveals high tail latency hidden by avg(...).

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Occasional user complaints (dismiss as "isolated") Systematic tail latency histogram_quantile(0.99) confirms the tail is consistent, not isolated
Healthy service based on average Tail latency problem Average is misleading; p99 tells the true story for affected users
Random network issues Slow code path on specific conditions p99 latency is consistently high for specific endpoints or time windows

The Fix (Generic)

  1. Immediate: Calculate p50, p95, p99 from existing Prometheus histograms: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)).
  2. Short-term: Replace all "average latency" panels with p50/p95/p99 panels; set SLOs on p99, not on mean.
  3. Long-term: Use histogram metrics (_bucket, _sum, _count) in applications, not simple counters; define SLOs as error budgets on p99 latency and error rate.

Real-World Examples

  • Example 1: Search service: avg latency 40ms, p99 6.5s. The avg was dragged down by 99% of fast (in-cache) results; 1% of queries hit the database (no cache hit) and were slow. 6,000 users/minute experiencing 6.5s waits while the dashboard showed "healthy."
  • Example 2: Payment API: avg 200ms, p99 12s. 1% of payments triggered a fraud check synchronous call that was slow. Average-based SLO: green. p99-based SLO: severely breached. Discovered only after a compliance audit of payment processing times.

War Story

Our SRE team had 30 Grafana dashboards. Every one showed average latency. Our SLO: "average response time < 200ms." We were green. Users were complaining. We added p99 as an experiment: 4.2s. We'd been meeting our average-based SLO while 1% of our users experienced 4-second waits for months. The root cause: database queries without indexes. The slow queries affected a small percentage of requests (those that triggered a particular code path). Average: invisible. p99: obvious. We replaced every avg latency panel with p50/p99. Median was 45ms; p99 was 4.2s. Fixed the indexes; p99 dropped to 180ms.

Cross-References