Grafana: Dashboards That Don't Lie
- lesson
- grafana-dashboard-design
- promql-for-dashboards
- panel-types
- variable-templates
- alerting
- dashboard-as-code
- loki/logql
- tempo-integration
- observability-anti-patterns
- l2 ---# Grafana — Dashboards That Don't Lie
Topics: Grafana dashboard design, PromQL for dashboards, panel types, variable templates, alerting, dashboard-as-code, Loki/LogQL, Tempo integration, observability anti-patterns Level: L2 (Operations) Time: 50–70 minutes Prerequisites: None required (Prometheus basics explained inline)
The Mission¶
It's 2:47 AM. PagerDuty fires. Customers are reporting failed checkouts. You open the on-call dashboard — the one your team built six months ago with 47 panels.
Everything is green. CPU at 22%. Memory at 58%. Error rate 0.3%. Latency 120ms.
You flip to the Slack channel. Fifteen customers posted screenshots of 500 errors in the last ten minutes. Support says the checkout API is "completely broken."
You stare at the dashboard. It stares back. All green.
The dashboard is lying. Not because someone configured it wrong on purpose — but because the panels are answering the wrong questions, averaging away the signal, and hiding the outage behind comfortable numbers.
Your mission: understand why dashboards lie, then build ones that don't.
Part 1: Why That Dashboard Lied¶
Before we build anything, let's autopsy the dashboard that showed green during an outage.
The checkout service runs on 8 pods. Seven are healthy. One is in a crash loop, returning 500 errors on every request. Kubernetes keeps restarting it, so it's technically "up" most of the time.
Here's what the dashboard showed and why it was wrong:
| Panel | Showed | Reality | Why it lied |
|---|---|---|---|
| CPU | 22% avg | 22% avg | Correct but irrelevant — users don't care about CPU |
| Memory | 58% | 58% | Same — a resource metric, not a user-experience metric |
| Error rate | 0.3% | 12.5% on the broken pod | Averaged across all 8 pods, the broken pod's errors vanished |
| Latency | 120ms | p99 was 8 seconds | Panel showed average, not percentiles |
War Story: This exact pattern — averaging hides the spike — is one of the most common dashboard failures in production. A team at a fintech company reported "dashboards showed green" during a 23-minute outage that cost them $180K in failed transactions. The root cause: their error rate panel used
avg(rate(http_errors_total[5m]))across 12 instances. One instance had a 100% error rate; the other 11 were at 0%. The average: 8.3%, which their alert threshold of 10% didn't catch. After the incident, they switched tosum(rate(errors[5m])) / sum(rate(total[5m]))— the global error rate, not the average of per-instance rates.
The fix isn't "add more panels." It's asking better questions.
Part 2: The Frameworks — USE and RED¶
Two frameworks tell you which questions to ask. You need both.
RED Method — For Services¶
Rate, Errors, Duration. Created by Tom Wilkie at Grafana Labs. Answers: "How is the service doing from the user's perspective?"
# Rate: requests per second
sum(rate(http_requests_total{service="checkout"}[5m]))
# Errors: error rate as a percentage
sum(rate(http_requests_total{service="checkout", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="checkout"}[5m]))
# Duration: p99 latency
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="checkout"}[5m])) by (le)
)
Every service should have a RED dashboard. If a service doesn't have one, you're flying blind.
USE Method — For Resources¶
Utilization, Saturation, Errors. Created by Brendan Gregg at Netflix. Answers: "Is the infrastructure keeping up?"
# Utilization: CPU usage percentage
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
# Saturation: CPU run queue length (load average)
node_load1
# Errors: disk I/O errors
rate(node_disk_io_errors_total[5m])
Remember: RED for services (request-oriented), USE for infrastructure (resource-oriented). Mnemonic: RED lights for apps, USE tools for hardware.
The Four Golden Signals¶
Google's SRE book defines four golden signals: latency, traffic, errors, saturation. RED covers the first three for services. USE covers utilization, saturation, and errors for resources. Together, they're comprehensive.
| Framework | Scope | Signals | Creator |
|---|---|---|---|
| RED | Services | Rate, Errors, Duration | Tom Wilkie (Grafana Labs) |
| USE | Resources | Utilization, Saturation, Errors | Brendan Gregg (Netflix) |
| 4 Golden Signals | Both | Latency, Traffic, Errors, Saturation | Google SRE book |
Flashcard Check¶
| Question | Answer |
|---|---|
| What does RED stand for? | Rate, Errors, Duration — for services |
| What does USE stand for? | Utilization, Saturation, Errors — for resources |
| Who created the RED method? | Tom Wilkie (Grafana Labs) |
| When do you use RED vs USE? | RED for request-driven services, USE for infrastructure resources (CPU, disk, network) |
Part 3: Panel Types — Choosing the Right Visualization¶
Grafana has many panel types. Using the wrong one is like using a screwdriver as a hammer — you can, but you shouldn't.
The Decision Table¶
| Panel Type | Use When | Example | Don't Use When |
|---|---|---|---|
| Time series | Showing trends over time | Request rate, latency percentiles | Displaying a single current value |
| Stat | One number matters right now | Total requests today, current uptime | You need to see trends |
| Gauge | Value against a known range | CPU at 73%, disk 85% full | The max value is unknown or unbounded |
| Table | Comparing multiple items | Top 10 endpoints by error rate | You need to see time trends |
| Heatmap | Distribution over time | Latency distribution (where are requests clustering?) | Fewer than ~100 data points |
| Logs | Correlating events with metrics | Error logs alongside latency spikes | High-volume log streams without filtering |
Time Series: The Workhorse¶
Most panels on most dashboards are time series. A few guidelines:
- Show p50, p95, and p99 on the same panel. Three lines, one glance.
- Use the right unit. Grafana can auto-format seconds, bytes, percentages — set it.
- Use
$__rate_intervalinstead of hardcoding[5m]. It auto-adjusts to the dashboard's time range and Prometheus scrape interval.
# p50, p95, p99 on one panel — three queries, aliased
# Query A (alias: p50):
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le))
# Query B (alias: p95):
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le))
# Query C (alias: p99):
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le))
Gotcha:
$__rate_intervalwas introduced in Grafana 7.2. It calculates the minimum safe range forrate()based on your scrape interval and resolution. Before this, people hardcoded[5m]and got either noisy graphs (too short) or smoothed-away spikes (too long). If you're using an older Grafana,$__intervalis the next best thing, but$__rate_intervalis preferred.
Heatmaps: Seeing the Distribution¶
A time series panel showing p99 latency tells you one number. A heatmap shows you the entire distribution — where most requests cluster, and whether the tail is a thin spike or a wide plateau.
# Heatmap query for latency distribution
sum(increase(http_request_duration_seconds_bucket[$__rate_interval])) by (le)
Set the panel to "Heatmap" format, Y-axis to the bucket boundaries, and the color scheme to something where hot spots jump out. When you see a bimodal distribution (two bright bands), it usually means two different code paths are serving the same endpoint.
Trivia: Grafana was created by Torkel Odegaard in 2014 as a fork of Kibana 3's dashboard panel. He wanted better visualization for Graphite metrics. The name "Grafana" is a portmanteau — he originally misspelled "Graphite" + "Kibana" and the name stuck. By 2024, Grafana had over 20 million users and Grafana Labs was valued at $6 billion.
Part 4: PromQL for Dashboards That Tell the Truth¶
PromQL is where dashboards get their honesty — or their lies. Here are the queries that matter, with explanations of what each piece does.
rate() — The Foundation¶
rate() calculates the per-second increase of a counter over a time window.
| Piece | What it does |
|---|---|
http_requests_total |
Counter metric (only goes up, resets on restart) |
{service="checkout"} |
Label filter — only the checkout service |
[5m] |
Look back 5 minutes for data points |
rate(...) |
Per-second increase, averaged over the window |
Under the Hood:
rate()handles counter resets. When Prometheus detects a counter value decreasing (process restarted), it assumes a reset and compensates. This is why you never alert on raw counter values — they drop to zero on restart, makingrate()briefly unreliable. Userate()over at least 4x your scrape interval (for 15s scrape, use[1m]minimum,[5m]for stable alerting).
histogram_quantile() — Percentiles Done Right¶
The query that replaces misleading averages.
| Piece | What it does |
|---|---|
http_request_duration_seconds_bucket |
Histogram buckets (each le label is a boundary) |
rate(...[5m]) |
Per-second rate of observations falling into each bucket |
sum(...) by (le) |
Aggregate across all instances, keeping bucket boundaries |
histogram_quantile(0.99, ...) |
Estimate the value at the 99th percentile |
Gotcha:
histogram_quantileinterpolates linearly between bucket boundaries. If your SLO is "99% of requests under 200ms" but your buckets jump from 100ms to 250ms, the p99 calculation is an approximation that could be significantly off. Always add bucket boundaries at your SLO thresholds:
increase() — Total Count Over a Window¶
Good for stat panels showing "137 errors in the last hour."
predict_linear() — Seeing the Future¶
# Will this disk fill up in the next 4 hours?
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0
This takes 6 hours of historical data, fits a linear regression, and extrapolates 4 hours forward. If the predicted value is negative (disk full), fire an alert. This is the classic "disk filling up" alerting query — reactive monitoring notices when the disk is 90% full; predictive monitoring notices when it will be full on Tuesday.
Flashcard Check¶
| Question | Answer |
|---|---|
Why use rate() instead of raw counter values? |
Counters reset on restart; rate() handles resets and gives a per-second rate |
What does by (le) do in histogram queries? |
Preserves bucket boundaries so histogram_quantile can compute percentiles |
| Why is averaging percentiles wrong? | The average of p99 values across instances is NOT the true p99 of all requests combined |
What does predict_linear() do? |
Fits a linear regression on historical data and extrapolates to a future timestamp |
Part 5: Variable Templates — One Dashboard, Every Environment¶
Hardcoding {namespace="production"} in every query means you need separate dashboards for
dev, staging, and production. Variables fix this.
Setting Up a Namespace Variable¶
In Dashboard Settings > Variables, create a query variable:
| Setting | Value |
|---|---|
| Name | namespace |
| Type | Query |
| Data source | Prometheus |
| Query | label_values(up, namespace) |
| Multi-value | Enabled |
| Include All | Enabled |
Now use $namespace in every query:
The =~ (regex match) handles multi-select. When the user picks "All," Grafana substitutes
a regex matching everything.
Chaining Variables¶
Variables can depend on each other. A service variable that filters based on the selected
namespace:
Now the service dropdown only shows services in the selected namespace. This is how you build a single dashboard that works for 50 services across 3 environments.
Mental Model: Think of variables as parameterized queries. A dashboard without variables is a report about one thing. A dashboard with variables is a tool that works on anything. The best on-call dashboards are tools, not reports.
Part 6: The Three-Tier Dashboard Architecture¶
Not every dashboard is for the same audience or moment. Build three tiers:
Tier 1: Overview — "Is anything broken?"¶
One dashboard, every service, RED metrics only. This is what the on-call checks first.
Panels: - Stat panels for each service: current error rate, colored red/green by threshold - Time series: global request rate, global error rate, global p99 latency - No per-pod detail. No infrastructure metrics. Just: are users happy?
Tier 2: Service — "What's broken in this service?"¶
One dashboard per service (using variables). RED metrics plus service-specific details.
Panels: - Request rate by endpoint - Error rate by endpoint and status code - Latency percentiles (p50, p95, p99) by endpoint - Recent deployments (annotation) - Pod restart count
Tier 3: Debug — "Why is this specific thing broken?"¶
Detailed infrastructure and application internals. Only opened during an investigation.
Panels: - Container CPU/memory per pod - Go runtime metrics (goroutines, GC pause) - Database connection pool utilization - Loki log panel filtered to the service - Tempo trace links via exemplars
Remember: Three-tier mnemonic: OSD — Overview, Service, Debug. Drill down from broad to narrow. During an incident, start at Tier 1, click through to Tier 2, then Tier 3. Each click narrows the investigation.
Part 7: Alerting — Waking Humans for the Right Reasons¶
Unified Alerting (Grafana 9+)¶
Grafana's unified alerting system replaced the old dashboard-bound alert rules. Now alerts are standalone objects with their own evaluation engine.
The key components:
| Component | What it does |
|---|---|
| Alert rule | A query + condition + evaluation interval |
| Contact point | Where notifications go (Slack, PagerDuty, email, webhook) |
| Notification policy | Routing tree: which alerts go to which contact points |
| Silence | Temporary mute during maintenance |
| Mute timing | Recurring schedule (e.g., no alerts on weekends for non-critical) |
Building an Alert Rule¶
For our checkout service, an alert that catches what the lying dashboard missed:
# Grafana alert rule (conceptual — created via UI or provisioning)
name: CheckoutHighErrorRate
condition: C
data:
- refId: A
# Total errors
expr: sum(rate(http_requests_total{service="checkout", status=~"5.."}[$__rate_interval]))
- refId: B
# Total requests
expr: sum(rate(http_requests_total{service="checkout"}[$__rate_interval]))
- refId: C
# Error percentage
expr: $A / $B
condition: gt
threshold: 0.01 # 1% error rate
evaluation_interval: 1m
pending_period: 3m # must be true for 3 minutes
labels:
severity: critical
team: platform
annotations:
summary: "Checkout error rate is {{ $values.C | humanizePercentage }}"
runbook_url: "https://wiki.internal/runbooks/checkout-errors"
Notice the critical difference from the lying dashboard: this computes the global error
rate (sum(errors) / sum(total)), not the average of per-instance rates. One pod returning
100% errors out of 8 pods produces a 12.5% global error rate — well above the 1% threshold.
Notification Policies¶
Route alerts to the right people:
# Notification policy tree
policies:
- receiver: slack-general
group_by: [alertname, namespace]
group_wait: 30s
routes:
- match:
severity: critical
receiver: pagerduty-oncall
continue: true # also send to Slack
- match:
severity: critical
receiver: slack-critical
- match:
severity: warning
receiver: slack-warnings
group_interval: 15m
Gotcha: The
continue: trueflag means "keep matching subsequent routes after this one." Without it, the first match wins and routing stops. Usecontinue: truewhen you want an alert to hit multiple receivers (PagerDuty AND Slack). Omit it for mutually exclusive routing. Test your routing withamtool config routes testif you're using Alertmanager directly.
Silences — When You Need Quiet¶
Planned maintenance at 3 AM? Silence the disk alerts before you start:
# Alertmanager CLI
amtool silence add alertname="DiskFillingUp" instance="node3:9100" \
--comment="Disk replacement on node3" --duration=4h
# Or via Grafana UI: Alerting > Silences > New Silence
Part 8: Dashboard-as-Code — Stop Clicking, Start Committing¶
A dashboard configured by hand in the Grafana UI has no history, no review process, and no way to recover if someone accidentally deletes it at 2 AM.
Provisioning with YAML + JSON¶
Grafana reads provisioning files on startup from /etc/grafana/provisioning/.
Data source provisioning:
# /etc/grafana/provisioning/datasources/prometheus.yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
jsonData:
timeInterval: "15s" # Match your scrape interval
exemplarTraceIdDestinations:
- name: traceID
datasourceUid: tempo
Dashboard provisioning:
# /etc/grafana/provisioning/dashboards/default.yaml
apiVersion: 1
providers:
- name: default
type: file
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
Then place dashboard JSON files in /var/lib/grafana/dashboards/. Here is a minimal but
complete dashboard JSON for a RED method overview:
{
"dashboard": {
"title": "Checkout Service — RED",
"uid": "checkout-red-v1",
"tags": ["service", "checkout", "red"],
"timezone": "utc",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"gridPos": {"h": 8, "w": 8, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total{service=\"checkout\"}[$__rate_interval]))",
"legendFormat": "req/sec"
}
],
"fieldConfig": {
"defaults": {"unit": "reqps"}
}
},
{
"title": "Error Rate",
"type": "timeseries",
"gridPos": {"h": 8, "w": 8, "x": 8, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total{service=\"checkout\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{service=\"checkout\"}[$__rate_interval]))",
"legendFormat": "error %"
}
],
"fieldConfig": {
"defaults": {"unit": "percentunit", "max": 1, "thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 0.01},
{"color": "red", "value": 0.05}
]
}}
}
},
{
"title": "Latency Percentiles",
"type": "timeseries",
"gridPos": {"h": 8, "w": 8, "x": 16, "y": 0},
"targets": [
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"checkout\"}[$__rate_interval])) by (le))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"checkout\"}[$__rate_interval])) by (le))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"checkout\"}[$__rate_interval])) by (le))",
"legendFormat": "p99"
}
],
"fieldConfig": {
"defaults": {"unit": "s"}
}
}
],
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"query": "label_values(up, namespace)",
"multi": true,
"includeAll": true
}
]
},
"time": {"from": "now-1h", "to": "now"},
"refresh": "30s"
}
}
Store this in Git. Deploy via CI. Never hand-edit in the UI again.
Grafonnet / Jsonnet¶
For teams managing dozens of dashboards, JSON gets unwieldy. Grafonnet is a Jsonnet library for generating Grafana dashboard JSON programmatically:
// checkout-dashboard.jsonnet
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;
dashboard.new(
'Checkout Service — RED',
tags=['service', 'checkout', 'red'],
time_from='now-1h',
refresh='30s',
)
.addPanel(
graphPanel.new(
'Request Rate',
datasource='Prometheus',
).addTarget(
prometheus.target(
'sum(rate(http_requests_total{service="checkout"}[$__rate_interval]))',
legendFormat='req/sec',
)
), gridPos={h: 8, w: 8, x: 0, y: 0}
)
Compile with jsonnet -J vendor checkout-dashboard.jsonnet > checkout-dashboard.json. The
generated JSON gets provisioned as before.
Terraform Provider¶
For Grafana instances managed as infrastructure:
# grafana.tf
resource "grafana_dashboard" "checkout_red" {
config_json = file("${path.module}/dashboards/checkout-red.json")
folder = grafana_folder.services.id
overwrite = true
}
resource "grafana_data_source" "prometheus" {
type = "prometheus"
name = "Prometheus"
url = "http://prometheus:9090"
json_data_encoded = jsonencode({
timeInterval = "15s"
})
}
resource "grafana_contact_point" "pagerduty" {
name = "pagerduty-oncall"
pagerduty {
integration_key = var.pagerduty_key
severity = "critical"
}
}
Mental Model: Dashboard-as-code has three levels of maturity: 1. Manual — Click in the UI. No history. Fragile. 2. Provisioned — JSON in Git, loaded on startup. Reviewable, recoverable. 3. Generated — Jsonnet/Terraform generates JSON from templates. Consistent, scalable.
Most teams should aim for level 2. Level 3 pays off when you have 20+ dashboards with shared patterns.
Part 9: Loki and LogQL — When Metrics Aren't Enough¶
Metrics tell you what is wrong. Logs tell you why. Grafana bridges both.
LogQL in 60 Seconds¶
# Select a log stream by labels
{namespace="production", app="checkout"}
# Filter by content
{app="checkout"} |= "error" # contains "error"
{app="checkout"} != "healthcheck" # exclude healthchecks
{app="checkout"} |~ "status=(4|5).." # regex match
# Parse JSON and filter on fields
{app="checkout"} | json | status_code >= 500
# Metrics from logs — count errors per minute
sum(rate({app="checkout"} |= "error" [5m]))
# Top error endpoints from structured logs
sum by (path) (count_over_time({app="checkout"} | json | status_code >= 500 [1h]))
Name Origin: Loki is named after the Norse trickster god — fitting because it's deceptively lightweight. Unlike Elasticsearch, which indexes every word in every log line, Loki only indexes the labels (namespace, app, pod). The log content is stored as compressed chunks and only searched when you query. This makes Loki dramatically cheaper to operate — often 10x less infrastructure cost than an equivalent Elasticsearch cluster.
The Power Move: Metrics-to-Logs Drill-Down¶
In Grafana, you can click a spike on a Prometheus time series panel and split the view to show Loki logs from the exact same time range. Configure it with a "derived field" or by setting up data links between Prometheus and Loki panels.
This is the killer workflow during incidents: 1. Tier 1 dashboard: notice error rate spike (Prometheus) 2. Click the spike: see error logs from that time window (Loki) 3. Spot a stack trace with a trace ID 4. Click the trace ID: see the full distributed trace (Tempo)
Three data sources, one investigation flow, seconds instead of minutes.
Part 10: Tempo — Following a Request Across Services¶
Grafana Tempo stores distributed traces in object storage (S3, GCS) without requiring a separate indexing database. This makes it cheaper than Jaeger with Elasticsearch at scale.
Connecting Metrics to Traces with Exemplars¶
Exemplars attach a trace ID to specific metric observations. When you see a latency spike in a histogram, exemplars let you click through to the exact trace that caused it.
Enable exemplars in your Prometheus data source configuration:
# In the data source provisioning YAML
jsonData:
exemplarTraceIdDestinations:
- name: traceID
datasourceUid: tempo
Now, on a time series panel showing latency, small diamonds appear at outlier data points. Click one, and Grafana opens the trace in Tempo. No more guessing which request was slow.
TraceQL — Querying Traces¶
# Find slow checkout requests with errors
{ span.http.target = "/api/checkout" && status = error && duration > 2s }
Trivia: The "PLT stack" — Prometheus, Loki, Tempo — is Grafana Labs' answer to the commercial observability platforms. All three use the same label-based data model, making correlation across metrics, logs, and traces seamless. The stack is sometimes called "LGTM" (Loki, Grafana, Tempo, Mimir) in marketing materials — an intentional pun on code review approvals.
Part 11: Anti-Patterns — Dashboards That Actively Hurt You¶
Too Many Panels¶
A dashboard with 50 panels takes 30 seconds to load. During an incident, you scroll past 47 irrelevant panels to find the 3 that matter. By the time you find them, it's been 5 minutes.
Rule of thumb: 12 panels maximum on an overview dashboard. If you need more, you need another tier, not another row.
Percentile Misuse¶
# WRONG: average of per-instance p99 values
avg(histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))
# RIGHT: p99 of all requests across all instances
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
The wrong query averages p99 values. Two instances with p99 of 100ms and 900ms do not average to a meaningful global p99. Aggregate the buckets first, then compute the percentile.
Counter Resets¶
A service restarts. The counter drops from 150,000 to 0. rate() handles this correctly in
steady state — but for 1-2 scrape intervals after a restart, the rate calculation can produce
brief artifacts. A dashboard showing irate() (instant rate, last two samples only) will
show a spike at the restart boundary.
Fix: Use rate() (averaged over a window) for dashboards and alerting, not irate().
irate() is for interactive exploration where you want maximum responsiveness.
Missing absent() Alerts¶
Your service crashes. It stops emitting metrics. Your error rate alert evaluates to "no data" — and most alert configurations treat "no data" as "not firing." The service is down, and nobody knows.
# Always pair RED alerts with an absent() check
- alert: CheckoutMetricsMissing
expr: absent(up{job="checkout"} == 1)
for: 2m
labels:
severity: critical
annotations:
summary: "Checkout service is not being scraped — service may be down"
Flashcard Check¶
| Question | Answer |
|---|---|
Why is avg(per-instance p99) wrong? |
You can't average percentiles — aggregate the histogram buckets first, then compute the quantile |
What does absent() detect? |
When a metric has completely vanished (not just zero — gone) |
Why use rate() over irate() for alerting? |
rate() averages over a window for stability; irate() uses only the last two samples, making it noisy |
| What happens to alerts when a target stops emitting? | Most expressions evaluate to "no data," which doesn't fire the alert — leading to silent failures |
Exercises¶
Exercise 1: Spot the Lie (2 minutes)¶
A dashboard shows avg(rate(http_requests_total{status="500"}[5m])) as the error rate.
There are 10 pods. Nine have zero errors. One has a 50% error rate.
What does the dashboard show? What is the actual global error rate?
Answer
The dashboard shows 5% (the average of nine 0% values and one 50% value). But if the broken pod handles 10% of traffic, the global error rate is 5%. If it handles 1% of traffic, the global error rate is 0.5%. The `avg()` query doesn't weight by traffic volume — it treats a pod handling 1 request/sec the same as a pod handling 1,000 request/sec. The correct query: `sum(rate(errors[5m])) / sum(rate(total[5m]))` weights by actual traffic.Exercise 2: Build a RED Panel Set (10 minutes)¶
Create three Grafana panels for a service called payment-api:
1. Request rate (requests per second)
2. Error rate (percentage of 5xx responses)
3. Latency percentiles (p50, p95, p99)
Write the PromQL for each. Use $__rate_interval and a $namespace variable.
Solution
# Request rate
sum(rate(http_requests_total{service="payment-api", namespace=~"$namespace"}[$__rate_interval]))
# Error rate
sum(rate(http_requests_total{service="payment-api", namespace=~"$namespace", status=~"5.."}[$__rate_interval]))
/
sum(rate(http_requests_total{service="payment-api", namespace=~"$namespace"}[$__rate_interval]))
# p50
histogram_quantile(0.50,
sum(rate(http_request_duration_seconds_bucket{service="payment-api", namespace=~"$namespace"}[$__rate_interval])) by (le))
# p95
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{service="payment-api", namespace=~"$namespace"}[$__rate_interval])) by (le))
# p99
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="payment-api", namespace=~"$namespace"}[$__rate_interval])) by (le))
Exercise 3: Dashboard Autopsy (15 minutes)¶
Your on-call dashboard has these panels: - CPU usage per node (gauge) - Memory usage per node (gauge) - Network bytes in/out (time series) - Disk IOPS (time series) - Pod count by namespace (stat) - 47 more infrastructure metrics
A user reports 504 Gateway Timeout errors. How long would it take you to diagnose the issue using this dashboard? What panels would you add, and what would you remove?
Hint
This is a USE-only dashboard with no RED metrics. It tells you about infrastructure health but nothing about user experience. A user could be getting 504s while every infrastructure panel is green (the problem might be in application logic, not resource exhaustion).Solution approach
Add RED panels at the top: request rate, error rate (especially 504s), latency percentiles. Add a table showing error rate by service to identify which service is producing 504s. Move infrastructure panels to a separate Tier 3 debug dashboard. The redesigned dashboard should answer "is anything broken for users?" in under 5 seconds. The current dashboard cannot answer that question at all.Cheat Sheet¶
PromQL Quick Reference¶
| Query | What it does |
|---|---|
rate(counter[5m]) |
Per-second rate of increase over 5 minutes |
increase(counter[1h]) |
Total increase over 1 hour |
histogram_quantile(0.99, sum(rate(buckets[5m])) by (le)) |
Global p99 latency |
sum(rate(errors[5m])) / sum(rate(total[5m])) |
Global error rate (weighted by traffic) |
predict_linear(gauge[6h], 4*3600) < 0 |
Will this value hit zero in 4 hours? |
absent(up{job="x"} == 1) |
Is this target completely gone? |
topk(5, sum by (handler) (rate(total[5m]))) |
Top 5 endpoints by request rate |
Panel Type Selection¶
| You want to show... | Use this panel |
|---|---|
| A trend over time | Time series |
| One important number | Stat |
| A value against a known range | Gauge |
| Ranking or comparison of items | Table |
| Distribution of values over time | Heatmap |
| Event context alongside metrics | Logs |
Dashboard Design Rules¶
| Rule | Why |
|---|---|
| RED for services, USE for resources | Covers both user experience and infrastructure |
sum/sum not avg for error rates |
Averages hide broken instances |
| Percentiles not averages for latency | Averages hide tail latency |
| 12 panels max per overview dashboard | More panels = slower load = slower incident response |
Always add absent() alerts |
Detect silent failures when metrics vanish |
| Variables for namespace/service | One dashboard works everywhere |
| Store dashboard JSON in Git | History, review, recovery |
LogQL Quick Reference¶
| Pattern | What it does |
|---|---|
{app="x"} |
Select log stream by label |
\|= "error" |
Filter: line contains "error" |
!= "healthcheck" |
Exclude: line contains "healthcheck" |
\|~ "status=(4\|5).." |
Regex filter |
\| json \| status >= 500 |
Parse JSON, filter on field |
count_over_time({app="x"} \|= "error" [5m]) |
Count matching lines over 5 minutes |
Takeaways¶
-
Averages lie. Use
sum(errors)/sum(total)for error rates,histogram_quantilefor latency. Never average percentiles across instances. -
RED for services, USE for resources. Every service needs Rate, Errors, Duration panels. Infrastructure metrics belong on separate debug dashboards, not the on-call view.
-
Three-tier dashboards: Overview, Service, Debug. Start broad, drill down. The on-call dashboard should answer "is anything broken?" in under 5 seconds.
-
Always alert on
absent(). A dead service emits no metrics. "No data" is not "no problem." -
Dashboard-as-code is not optional. If your dashboards live only in the Grafana UI, they have no history, no review, and no disaster recovery. Store JSON in Git.
-
Metrics, logs, traces are not three tools — they're one workflow. Prometheus spike leads to Loki logs leads to Tempo traces. Configure the drill-down links. The seconds you save during an incident justify the hours of setup.
Related Lessons¶
- The Monitoring That Lied — Deep dive on all the ways metrics deceive you
- Prometheus and the Art of Not Alerting — Alert design philosophy
- SLOs: When Good Enough Is a Number — Error budgets and SLO-based alerting
- Log Pipelines: From Printf to Dashboard — The full logging stack
- OpenTelemetry: Following a Request Across Services — Distributed tracing deep dive