The Metrics That Lied¶
Category: The Mystery Domains: monitoring, observability Read time: ~5 min
Setting the Scene¶
We had beautiful dashboards. Seriously -- our Grafana setup was the envy of the engineering org. CPU utilization, memory, request rates, error rates, all laid out in color-coded panels with nice thresholds. When the product team reported that the recommendation engine was "feeling slow," I pulled up the dashboard and saw CPU at 18%, memory at 42%, error rate at 0.03%. Everything looked perfect.
But users were seeing 3-5 second response times on a call that should take 200ms. Something was very wrong, and our dashboards were telling us everything was fine.
What Happened¶
I started where the metrics told me to look: nowhere. The dashboard said the system was healthy, so I suspected the client. Maybe the mobile app had a rendering issue. Maybe a CDN edge was slow. I asked the mobile team to add client-side timing. They came back a day later: the server response itself was taking 3-4 seconds. Not the client.
Okay, so maybe it was the network between the CDN and our origin. I ran curl -w "@curl-format.txt" from various locations. Response times were fine from my test machine. I couldn't reproduce it.
Then I tried something I should have done first: I hit the actual production endpoint 100 times in a loop using ab -n 100 -c 1. Average response time: 280ms. But the max was 4,200ms. I ran it again. Average: 310ms. Max: 3,800ms.
I went back to Grafana and looked at the Prometheus query behind the CPU panel: rate(node_cpu_seconds_total{mode!="idle"}[5m]). Five-minute average. The recommendation engine was a batch-style workload -- it would spike to 100% CPU for 2-3 seconds while computing embeddings, then idle. Averaged over 5 minutes, those spikes disappeared into an 18% average.
I changed the query window to [15s] and the dashboard lit up like a Christmas tree. CPU was oscillating between 5% and 100% every few seconds. The 100% spikes were causing request queuing, which showed up as the latency users were complaining about.
The Moment of Truth¶
I ran mpstat -P ALL 1 10 on the host and watched individual CPU cores hit 100% while the overall average stayed calm. The service was single-threaded for the embedding computation -- one core pegged, seven cores idle. The 5-minute average across all cores turned a crisis into a flat line on the dashboard.
The Aftermath¶
We redesigned the dashboard: p50, p95, and p99 latency replaced the average panel. CPU was shown at both 5-minute and 15-second granularity. We added per-core CPU panels and a "max core utilization" metric. Then we parallelized the embedding computation across 4 cores, which dropped the spike duration from 3 seconds to 700ms. User complaints stopped the same day.
The Lessons¶
- Averages hide reality: A 5-minute average CPU metric can hide 100% spikes that cause real user impact. Always pair averages with percentiles or max values.
- Use percentiles for latency: p99 latency tells you what your unhappiest users experience. The average tells you almost nothing useful.
- Look at the raw data: When the dashboard says "everything's fine" but users say otherwise, bypass the dashboard. Use
mpstat,pidstat,perf top, or raw Prometheus queries with shorter windows.
What I'd Do Differently¶
Every new dashboard panel gets a mandatory review: "What does this metric hide?" Default to 30-second windows for CPU and add p99 latency as a first-class SLI from day one. Never trust a single aggregation window.
The Quote¶
"Our dashboards were so beautiful they convinced us nothing was wrong, while users waited 4 seconds for a 200ms call."
Cross-References¶
- Topic Packs: Monitoring Fundamentals, Observability Deep Dive, Prometheus Deep Dive, Linux Performance