Grafana: Dashboards That Don't Lie

lesson
grafana-dashboard-design
promql-for-dashboards
panel-types
variable-templates
alerting
dashboard-as-code
loki/logql
tempo-integration
observability-anti-patterns
l2 ---# Grafana — Dashboards That Don't Lie

Topics: Grafana dashboard design, PromQL for dashboards, panel types, variable templates, alerting, dashboard-as-code, Loki/LogQL, Tempo integration, observability anti-patterns Level: L2 (Operations) Time: 50–70 minutes Prerequisites: None required (Prometheus basics explained inline)

The Mission¶

It's 2:47 AM. PagerDuty fires. Customers are reporting failed checkouts. You open the on-call dashboard — the one your team built six months ago with 47 panels.

Everything is green. CPU at 22%. Memory at 58%. Error rate 0.3%. Latency 120ms.

You flip to the Slack channel. Fifteen customers posted screenshots of 500 errors in the last ten minutes. Support says the checkout API is "completely broken."

You stare at the dashboard. It stares back. All green.

The dashboard is lying. Not because someone configured it wrong on purpose — but because the panels are answering the wrong questions, averaging away the signal, and hiding the outage behind comfortable numbers.

Your mission: understand why dashboards lie, then build ones that don't.

Part 1: Why That Dashboard Lied¶

Before we build anything, let's autopsy the dashboard that showed green during an outage.

The checkout service runs on 8 pods. Seven are healthy. One is in a crash loop, returning 500 errors on every request. Kubernetes keeps restarting it, so it's technically "up" most of the time.

Here's what the dashboard showed and why it was wrong:

Panel	Showed	Reality	Why it lied
CPU	22% avg	22% avg	Correct but irrelevant — users don't care about CPU
Memory	58%	58%	Same — a resource metric, not a user-experience metric
Error rate	0.3%	12.5% on the broken pod	Averaged across all 8 pods, the broken pod's errors vanished
Latency	120ms	p99 was 8 seconds	Panel showed average, not percentiles

War Story: This exact pattern — averaging hides the spike — is one of the most common dashboard failures in production. A team at a fintech company reported "dashboards showed green" during a 23-minute outage that cost them $180K in failed transactions. The root cause: their error rate panel used avg(rate(http_errors_total[5m])) across 12 instances. One instance had a 100% error rate; the other 11 were at 0%. The average: 8.3%, which their alert threshold of 10% didn't catch. After the incident, they switched to sum(rate(errors[5m])) / sum(rate(total[5m])) — the global error rate, not the average of per-instance rates.

The fix isn't "add more panels." It's asking better questions.

Part 2: The Frameworks — USE and RED¶

Two frameworks tell you which questions to ask. You need both.

RED Method — For Services¶

Rate, Errors, Duration. Created by Tom Wilkie at Grafana Labs. Answers: "How is the service doing from the user's perspective?"

# Rate: requests per second
sum(rate(http_requests_total{service="checkout"}[5m]))

# Errors: error rate as a percentage
sum(rate(http_requests_total{service="checkout", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="checkout"}[5m]))

# Duration: p99 latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{service="checkout"}[5m])) by (le)
)

Every service should have a RED dashboard. If a service doesn't have one, you're flying blind.

USE Method — For Resources¶

Utilization, Saturation, Errors. Created by Brendan Gregg at Netflix. Answers: "Is the infrastructure keeping up?"

# Utilization: CPU usage percentage
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

# Saturation: CPU run queue length (load average)
node_load1

# Errors: disk I/O errors
rate(node_disk_io_errors_total[5m])

Remember: RED for services (request-oriented), USE for infrastructure (resource-oriented). Mnemonic: RED lights for apps, USE tools for hardware.

The Four Golden Signals¶

Google's SRE book defines four golden signals: latency, traffic, errors, saturation. RED covers the first three for services. USE covers utilization, saturation, and errors for resources. Together, they're comprehensive.

Framework	Scope	Signals	Creator
RED	Services	Rate, Errors, Duration	Tom Wilkie (Grafana Labs)
USE	Resources	Utilization, Saturation, Errors	Brendan Gregg (Netflix)
4 Golden Signals	Both	Latency, Traffic, Errors, Saturation	Google SRE book

Flashcard Check¶

Question	Answer
What does RED stand for?	Rate, Errors, Duration — for services
What does USE stand for?	Utilization, Saturation, Errors — for resources
Who created the RED method?	Tom Wilkie (Grafana Labs)
When do you use RED vs USE?	RED for request-driven services, USE for infrastructure resources (CPU, disk, network)

Part 3: Panel Types — Choosing the Right Visualization¶

Grafana has many panel types. Using the wrong one is like using a screwdriver as a hammer — you can, but you shouldn't.

The Decision Table¶

Panel Type	Use When	Example	Don't Use When
Time series	Showing trends over time	Request rate, latency percentiles	Displaying a single current value
Stat	One number matters right now	Total requests today, current uptime	You need to see trends
Gauge	Value against a known range	CPU at 73%, disk 85% full	The max value is unknown or unbounded
Table	Comparing multiple items	Top 10 endpoints by error rate	You need to see time trends
Heatmap	Distribution over time	Latency distribution (where are requests clustering?)	Fewer than ~100 data points
Logs	Correlating events with metrics	Error logs alongside latency spikes	High-volume log streams without filtering

Time Series: The Workhorse¶

Most panels on most dashboards are time series. A few guidelines:

Show p50, p95, and p99 on the same panel. Three lines, one glance.
Use the right unit. Grafana can auto-format seconds, bytes, percentages — set it.
Use $__rate_interval instead of hardcoding [5m]. It auto-adjusts to the dashboard's time range and Prometheus scrape interval.

# p50, p95, p99 on one panel — three queries, aliased
# Query A (alias: p50):
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le))

# Query B (alias: p95):
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le))

# Query C (alias: p99):
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le))

Gotcha: $__rate_interval was introduced in Grafana 7.2. It calculates the minimum safe range for rate() based on your scrape interval and resolution. Before this, people hardcoded [5m] and got either noisy graphs (too short) or smoothed-away spikes (too long). If you're using an older Grafana, $__interval is the next best thing, but $__rate_interval is preferred.

Heatmaps: Seeing the Distribution¶

A time series panel showing p99 latency tells you one number. A heatmap shows you the entire distribution — where most requests cluster, and whether the tail is a thin spike or a wide plateau.

# Heatmap query for latency distribution
sum(increase(http_request_duration_seconds_bucket[$__rate_interval])) by (le)

Set the panel to "Heatmap" format, Y-axis to the bucket boundaries, and the color scheme to something where hot spots jump out. When you see a bimodal distribution (two bright bands), it usually means two different code paths are serving the same endpoint.

Trivia: Grafana was created by Torkel Odegaard in 2014 as a fork of Kibana 3's dashboard panel. He wanted better visualization for Graphite metrics. The name "Grafana" is a portmanteau — he originally misspelled "Graphite" + "Kibana" and the name stuck. By 2024, Grafana had over 20 million users and Grafana Labs was valued at $6 billion.

Part 4: PromQL for Dashboards That Tell the Truth¶

PromQL is where dashboards get their honesty — or their lies. Here are the queries that matter, with explanations of what each piece does.

rate() — The Foundation¶

rate() calculates the per-second increase of a counter over a time window.

rate(http_requests_total{service="checkout"}[5m])

Piece	What it does
`http_requests_total`	Counter metric (only goes up, resets on restart)
`{service="checkout"}`	Label filter — only the checkout service
`[5m]`	Look back 5 minutes for data points
`rate(...)`	Per-second increase, averaged over the window

Under the Hood: rate() handles counter resets. When Prometheus detects a counter value decreasing (process restarted), it assumes a reset and compensates. This is why you never alert on raw counter values — they drop to zero on restart, making rate() briefly unreliable. Use rate() over at least 4x your scrape interval (for 15s scrape, use [1m] minimum, [5m] for stable alerting).

histogram_quantile() — Percentiles Done Right¶

The query that replaces misleading averages.

histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Piece	What it does
`http_request_duration_seconds_bucket`	Histogram buckets (each `le` label is a boundary)
`rate(...[5m])`	Per-second rate of observations falling into each bucket
`sum(...) by (le)`	Aggregate across all instances, keeping bucket boundaries
`histogram_quantile(0.99, ...)`	Estimate the value at the 99th percentile

Gotcha: histogram_quantile interpolates linearly between bucket boundaries. If your SLO is "99% of requests under 200ms" but your buckets jump from 100ms to 250ms, the p99 calculation is an approximation that could be significantly off. Always add bucket boundaries at your SLO thresholds:
// Go prometheus client — custom buckets
prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Buckets: []float64{.005, .01, .025, .05, .1, .15, .2, .25, .3, .5, 1, 2.5, 5, 10},
}

increase() — Total Count Over a Window¶

# How many 5xx errors in the last hour?
increase(http_requests_total{status=~"5.."}[1h])

Good for stat panels showing "137 errors in the last hour."

predict_linear() — Seeing the Future¶

# Will this disk fill up in the next 4 hours?
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0

This takes 6 hours of historical data, fits a linear regression, and extrapolates 4 hours forward. If the predicted value is negative (disk full), fire an alert. This is the classic "disk filling up" alerting query — reactive monitoring notices when the disk is 90% full; predictive monitoring notices when it will be full on Tuesday.

Flashcard Check¶

Question	Answer
Why use `rate()` instead of raw counter values?	Counters reset on restart; `rate()` handles resets and gives a per-second rate
What does `by (le)` do in histogram queries?	Preserves bucket boundaries so `histogram_quantile` can compute percentiles
Why is averaging percentiles wrong?	The average of p99 values across instances is NOT the true p99 of all requests combined
What does `predict_linear()` do?	Fits a linear regression on historical data and extrapolates to a future timestamp

Part 5: Variable Templates — One Dashboard, Every Environment¶

Hardcoding {namespace="production"} in every query means you need separate dashboards for dev, staging, and production. Variables fix this.

Setting Up a Namespace Variable¶

In Dashboard Settings > Variables, create a query variable:

Setting	Value
Name	`namespace`
Type	Query
Data source	Prometheus
Query	`label_values(up, namespace)`
Multi-value	Enabled
Include All	Enabled

Now use $namespace in every query:

sum(rate(http_requests_total{namespace=~"$namespace"}[5m])) by (service)

The =~ (regex match) handles multi-select. When the user picks "All," Grafana substitutes a regex matching everything.

Chaining Variables¶

Variables can depend on each other. A service variable that filters based on the selected namespace:

label_values(up{namespace=~"$namespace"}, service)

Now the service dropdown only shows services in the selected namespace. This is how you build a single dashboard that works for 50 services across 3 environments.

Mental Model: Think of variables as parameterized queries. A dashboard without variables is a report about one thing. A dashboard with variables is a tool that works on anything. The best on-call dashboards are tools, not reports.

Part 6: The Three-Tier Dashboard Architecture¶

Not every dashboard is for the same audience or moment. Build three tiers:

Tier 1: Overview — "Is anything broken?"¶

One dashboard, every service, RED metrics only. This is what the on-call checks first.

Panels: - Stat panels for each service: current error rate, colored red/green by threshold - Time series: global request rate, global error rate, global p99 latency - No per-pod detail. No infrastructure metrics. Just: are users happy?

Tier 2: Service — "What's broken in this service?"¶

One dashboard per service (using variables). RED metrics plus service-specific details.

Panels: - Request rate by endpoint - Error rate by endpoint and status code - Latency percentiles (p50, p95, p99) by endpoint - Recent deployments (annotation) - Pod restart count

Tier 3: Debug — "Why is this specific thing broken?"¶

Detailed infrastructure and application internals. Only opened during an investigation.

Panels: - Container CPU/memory per pod - Go runtime metrics (goroutines, GC pause) - Database connection pool utilization - Loki log panel filtered to the service - Tempo trace links via exemplars

Remember: Three-tier mnemonic: OSD — Overview, Service, Debug. Drill down from broad to narrow. During an incident, start at Tier 1, click through to Tier 2, then Tier 3. Each click narrows the investigation.

Part 7: Alerting — Waking Humans for the Right Reasons¶

Unified Alerting (Grafana 9+)¶

Grafana's unified alerting system replaced the old dashboard-bound alert rules. Now alerts are standalone objects with their own evaluation engine.

The key components:

Component	What it does
Alert rule	A query + condition + evaluation interval
Contact point	Where notifications go (Slack, PagerDuty, email, webhook)
Notification policy	Routing tree: which alerts go to which contact points
Silence	Temporary mute during maintenance
Mute timing	Recurring schedule (e.g., no alerts on weekends for non-critical)

Building an Alert Rule¶

For our checkout service, an alert that catches what the lying dashboard missed:

# Grafana alert rule (conceptual — created via UI or provisioning)
name: CheckoutHighErrorRate
condition: C
data:
  - refId: A
    # Total errors
    expr: sum(rate(http_requests_total{service="checkout", status=~"5.."}[$__rate_interval]))
  - refId: B
    # Total requests
    expr: sum(rate(http_requests_total{service="checkout"}[$__rate_interval]))
  - refId: C
    # Error percentage
    expr: $A / $B
    condition: gt
    threshold: 0.01  # 1% error rate
evaluation_interval: 1m
pending_period: 3m  # must be true for 3 minutes
labels:
  severity: critical
  team: platform
annotations:
  summary: "Checkout error rate is {{ $values.C | humanizePercentage }}"
  runbook_url: "https://wiki.internal/runbooks/checkout-errors"

Notice the critical difference from the lying dashboard: this computes the global error rate (sum(errors) / sum(total)), not the average of per-instance rates. One pod returning 100% errors out of 8 pods produces a 12.5% global error rate — well above the 1% threshold.

Notification Policies¶

Route alerts to the right people:

# Notification policy tree
policies:
  - receiver: slack-general
    group_by: [alertname, namespace]
    group_wait: 30s
    routes:
      - match:
          severity: critical
        receiver: pagerduty-oncall
        continue: true  # also send to Slack
      - match:
          severity: critical
        receiver: slack-critical
      - match:
          severity: warning
        receiver: slack-warnings
        group_interval: 15m

Gotcha: The continue: true flag means "keep matching subsequent routes after this one." Without it, the first match wins and routing stops. Use continue: true when you want an alert to hit multiple receivers (PagerDuty AND Slack). Omit it for mutually exclusive routing. Test your routing with amtool config routes test if you're using Alertmanager directly.

Silences — When You Need Quiet¶

Planned maintenance at 3 AM? Silence the disk alerts before you start:

# Alertmanager CLI
amtool silence add alertname="DiskFillingUp" instance="node3:9100" \
  --comment="Disk replacement on node3" --duration=4h

# Or via Grafana UI: Alerting > Silences > New Silence

Part 8: Dashboard-as-Code — Stop Clicking, Start Committing¶

A dashboard configured by hand in the Grafana UI has no history, no review process, and no way to recover if someone accidentally deletes it at 2 AM.

Provisioning with YAML + JSON¶

Grafana reads provisioning files on startup from /etc/grafana/provisioning/.

Data source provisioning:

# /etc/grafana/provisioning/datasources/prometheus.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: "15s"   # Match your scrape interval
      exemplarTraceIdDestinations:
        - name: traceID
          datasourceUid: tempo

Dashboard provisioning:

# /etc/grafana/provisioning/dashboards/default.yaml
apiVersion: 1
providers:
  - name: default
    type: file
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

Then place dashboard JSON files in /var/lib/grafana/dashboards/. Here is a minimal but complete dashboard JSON for a RED method overview:

{
  "dashboard": {
    "title": "Checkout Service — RED",
    "uid": "checkout-red-v1",
    "tags": ["service", "checkout", "red"],
    "timezone": "utc",
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 8, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"checkout\"}[$__rate_interval]))",
            "legendFormat": "req/sec"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "reqps"}
        }
      },
      {
        "title": "Error Rate",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 8, "x": 8, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"checkout\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{service=\"checkout\"}[$__rate_interval]))",
            "legendFormat": "error %"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "percentunit", "max": 1, "thresholds": {
            "steps": [
              {"color": "green", "value": null},
              {"color": "yellow", "value": 0.01},
              {"color": "red", "value": 0.05}
            ]
          }}
        }
      },
      {
        "title": "Latency Percentiles",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 8, "x": 16, "y": 0},
        "targets": [
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"checkout\"}[$__rate_interval])) by (le))",
            "legendFormat": "p50"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"checkout\"}[$__rate_interval])) by (le))",
            "legendFormat": "p95"
          },
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"checkout\"}[$__rate_interval])) by (le))",
            "legendFormat": "p99"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "s"}
        }
      }
    ],
    "templating": {
      "list": [
        {
          "name": "namespace",
          "type": "query",
          "query": "label_values(up, namespace)",
          "multi": true,
          "includeAll": true
        }
      ]
    },
    "time": {"from": "now-1h", "to": "now"},
    "refresh": "30s"
  }
}

Store this in Git. Deploy via CI. Never hand-edit in the UI again.

Grafonnet / Jsonnet¶

For teams managing dozens of dashboards, JSON gets unwieldy. Grafonnet is a Jsonnet library for generating Grafana dashboard JSON programmatically:

// checkout-dashboard.jsonnet
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;

dashboard.new(
  'Checkout Service — RED',
  tags=['service', 'checkout', 'red'],
  time_from='now-1h',
  refresh='30s',
)
.addPanel(
  graphPanel.new(
    'Request Rate',
    datasource='Prometheus',
  ).addTarget(
    prometheus.target(
      'sum(rate(http_requests_total{service="checkout"}[$__rate_interval]))',
      legendFormat='req/sec',
    )
  ), gridPos={h: 8, w: 8, x: 0, y: 0}
)

Compile with jsonnet -J vendor checkout-dashboard.jsonnet > checkout-dashboard.json. The generated JSON gets provisioned as before.

Terraform Provider¶

For Grafana instances managed as infrastructure:

# grafana.tf
resource "grafana_dashboard" "checkout_red" {
  config_json = file("${path.module}/dashboards/checkout-red.json")
  folder      = grafana_folder.services.id
  overwrite   = true
}

resource "grafana_data_source" "prometheus" {
  type = "prometheus"
  name = "Prometheus"
  url  = "http://prometheus:9090"

  json_data_encoded = jsonencode({
    timeInterval = "15s"
  })
}

resource "grafana_contact_point" "pagerduty" {
  name = "pagerduty-oncall"
  pagerduty {
    integration_key = var.pagerduty_key
    severity        = "critical"
  }
}

Mental Model: Dashboard-as-code has three levels of maturity: 1. Manual — Click in the UI. No history. Fragile. 2. Provisioned — JSON in Git, loaded on startup. Reviewable, recoverable. 3. Generated — Jsonnet/Terraform generates JSON from templates. Consistent, scalable.

Most teams should aim for level 2. Level 3 pays off when you have 20+ dashboards with shared patterns.

Part 9: Loki and LogQL — When Metrics Aren't Enough¶

Metrics tell you what is wrong. Logs tell you why. Grafana bridges both.

LogQL in 60 Seconds¶

# Select a log stream by labels
{namespace="production", app="checkout"}

# Filter by content
{app="checkout"} |= "error"              # contains "error"
{app="checkout"} != "healthcheck"         # exclude healthchecks
{app="checkout"} |~ "status=(4|5).."      # regex match

# Parse JSON and filter on fields
{app="checkout"} | json | status_code >= 500

# Metrics from logs — count errors per minute
sum(rate({app="checkout"} |= "error" [5m]))

# Top error endpoints from structured logs
sum by (path) (count_over_time({app="checkout"} | json | status_code >= 500 [1h]))

Name Origin: Loki is named after the Norse trickster god — fitting because it's deceptively lightweight. Unlike Elasticsearch, which indexes every word in every log line, Loki only indexes the labels (namespace, app, pod). The log content is stored as compressed chunks and only searched when you query. This makes Loki dramatically cheaper to operate — often 10x less infrastructure cost than an equivalent Elasticsearch cluster.

The Power Move: Metrics-to-Logs Drill-Down¶

In Grafana, you can click a spike on a Prometheus time series panel and split the view to show Loki logs from the exact same time range. Configure it with a "derived field" or by setting up data links between Prometheus and Loki panels.

This is the killer workflow during incidents: 1. Tier 1 dashboard: notice error rate spike (Prometheus) 2. Click the spike: see error logs from that time window (Loki) 3. Spot a stack trace with a trace ID 4. Click the trace ID: see the full distributed trace (Tempo)

Three data sources, one investigation flow, seconds instead of minutes.

Part 10: Tempo — Following a Request Across Services¶

Grafana Tempo stores distributed traces in object storage (S3, GCS) without requiring a separate indexing database. This makes it cheaper than Jaeger with Elasticsearch at scale.

Connecting Metrics to Traces with Exemplars¶

Exemplars attach a trace ID to specific metric observations. When you see a latency spike in a histogram, exemplars let you click through to the exact trace that caused it.

Enable exemplars in your Prometheus data source configuration:

# In the data source provisioning YAML
jsonData:
  exemplarTraceIdDestinations:
    - name: traceID
      datasourceUid: tempo

Now, on a time series panel showing latency, small diamonds appear at outlier data points. Click one, and Grafana opens the trace in Tempo. No more guessing which request was slow.

TraceQL — Querying Traces¶

# Find slow checkout requests with errors
{ span.http.target = "/api/checkout" && status = error && duration > 2s }

Trivia: The "PLT stack" — Prometheus, Loki, Tempo — is Grafana Labs' answer to the commercial observability platforms. All three use the same label-based data model, making correlation across metrics, logs, and traces seamless. The stack is sometimes called "LGTM" (Loki, Grafana, Tempo, Mimir) in marketing materials — an intentional pun on code review approvals.

Part 11: Anti-Patterns — Dashboards That Actively Hurt You¶

Too Many Panels¶

A dashboard with 50 panels takes 30 seconds to load. During an incident, you scroll past 47 irrelevant panels to find the 3 that matter. By the time you find them, it's been 5 minutes.

Rule of thumb: 12 panels maximum on an overview dashboard. If you need more, you need another tier, not another row.

Percentile Misuse¶

# WRONG: average of per-instance p99 values
avg(histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))

# RIGHT: p99 of all requests across all instances
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

The wrong query averages p99 values. Two instances with p99 of 100ms and 900ms do not average to a meaningful global p99. Aggregate the buckets first, then compute the percentile.

Counter Resets¶

A service restarts. The counter drops from 150,000 to 0. rate() handles this correctly in steady state — but for 1-2 scrape intervals after a restart, the rate calculation can produce brief artifacts. A dashboard showing irate() (instant rate, last two samples only) will show a spike at the restart boundary.

Fix: Use rate() (averaged over a window) for dashboards and alerting, not irate(). irate() is for interactive exploration where you want maximum responsiveness.

Missing `absent()` Alerts¶

Your service crashes. It stops emitting metrics. Your error rate alert evaluates to "no data" — and most alert configurations treat "no data" as "not firing." The service is down, and nobody knows.

# Always pair RED alerts with an absent() check
- alert: CheckoutMetricsMissing
  expr: absent(up{job="checkout"} == 1)
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Checkout service is not being scraped — service may be down"

Flashcard Check¶

Question	Answer
Why is `avg(per-instance p99)` wrong?	You can't average percentiles — aggregate the histogram buckets first, then compute the quantile
What does `absent()` detect?	When a metric has completely vanished (not just zero — gone)
Why use `rate()` over `irate()` for alerting?	`rate()` averages over a window for stability; `irate()` uses only the last two samples, making it noisy
What happens to alerts when a target stops emitting?	Most expressions evaluate to "no data," which doesn't fire the alert — leading to silent failures

Exercises¶

Exercise 1: Spot the Lie (2 minutes)¶

A dashboard shows avg(rate(http_requests_total{status="500"}[5m])) as the error rate. There are 10 pods. Nine have zero errors. One has a 50% error rate.

What does the dashboard show? What is the actual global error rate?

Answer

The dashboard shows 5% (the average of nine 0% values and one 50% value). But if the broken pod handles 10% of traffic, the global error rate is 5%. If it handles 1% of traffic, the global error rate is 0.5%. The `avg()` query doesn't weight by traffic volume — it treats a pod handling 1 request/sec the same as a pod handling 1,000 request/sec. The correct query: `sum(rate(errors[5m])) / sum(rate(total[5m]))` weights by actual traffic.

Exercise 2: Build a RED Panel Set (10 minutes)¶

Create three Grafana panels for a service called payment-api: 1. Request rate (requests per second) 2. Error rate (percentage of 5xx responses) 3. Latency percentiles (p50, p95, p99)

Write the PromQL for each. Use $__rate_interval and a $namespace variable.

Solution

# Request rate
sum(rate(http_requests_total{service="payment-api", namespace=~"$namespace"}[$__rate_interval]))

# Error rate
sum(rate(http_requests_total{service="payment-api", namespace=~"$namespace", status=~"5.."}[$__rate_interval]))
/
sum(rate(http_requests_total{service="payment-api", namespace=~"$namespace"}[$__rate_interval]))

# p50
histogram_quantile(0.50,
  sum(rate(http_request_duration_seconds_bucket{service="payment-api", namespace=~"$namespace"}[$__rate_interval])) by (le))

# p95
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket{service="payment-api", namespace=~"$namespace"}[$__rate_interval])) by (le))

# p99
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{service="payment-api", namespace=~"$namespace"}[$__rate_interval])) by (le))

Exercise 3: Dashboard Autopsy (15 minutes)¶

Your on-call dashboard has these panels: - CPU usage per node (gauge) - Memory usage per node (gauge) - Network bytes in/out (time series) - Disk IOPS (time series) - Pod count by namespace (stat) - 47 more infrastructure metrics

A user reports 504 Gateway Timeout errors. How long would it take you to diagnose the issue using this dashboard? What panels would you add, and what would you remove?

Hint

This is a USE-only dashboard with no RED metrics. It tells you about infrastructure health but nothing about user experience. A user could be getting 504s while every infrastructure panel is green (the problem might be in application logic, not resource exhaustion).

Solution approach

Add RED panels at the top: request rate, error rate (especially 504s), latency percentiles. Add a table showing error rate by service to identify which service is producing 504s. Move infrastructure panels to a separate Tier 3 debug dashboard. The redesigned dashboard should answer "is anything broken for users?" in under 5 seconds. The current dashboard cannot answer that question at all.

Cheat Sheet¶

PromQL Quick Reference¶

Query	What it does
`rate(counter[5m])`	Per-second rate of increase over 5 minutes
`increase(counter[1h])`	Total increase over 1 hour
`histogram_quantile(0.99, sum(rate(buckets[5m])) by (le))`	Global p99 latency
`sum(rate(errors[5m])) / sum(rate(total[5m]))`	Global error rate (weighted by traffic)
`predict_linear(gauge[6h], 4*3600) < 0`	Will this value hit zero in 4 hours?
`absent(up{job="x"} == 1)`	Is this target completely gone?
`topk(5, sum by (handler) (rate(total[5m])))`	Top 5 endpoints by request rate

Panel Type Selection¶

You want to show...	Use this panel
A trend over time	Time series
One important number	Stat
A value against a known range	Gauge
Ranking or comparison of items	Table
Distribution of values over time	Heatmap
Event context alongside metrics	Logs

Dashboard Design Rules¶

Rule	Why
RED for services, USE for resources	Covers both user experience and infrastructure
`sum/sum` not `avg` for error rates	Averages hide broken instances
Percentiles not averages for latency	Averages hide tail latency
12 panels max per overview dashboard	More panels = slower load = slower incident response
Always add `absent()` alerts	Detect silent failures when metrics vanish
Variables for namespace/service	One dashboard works everywhere
Store dashboard JSON in Git	History, review, recovery

LogQL Quick Reference¶

Pattern	What it does
`{app="x"}`	Select log stream by label
`\\|= "error"`	Filter: line contains "error"
`!= "healthcheck"`	Exclude: line contains "healthcheck"
`\\|~ "status=(4\\|5).."`	Regex filter
`\\| json \\| status >= 500`	Parse JSON, filter on field
`count_over_time({app="x"} \\|= "error" [5m])`	Count matching lines over 5 minutes

Takeaways¶

Averages lie. Use sum(errors)/sum(total) for error rates, histogram_quantile for latency. Never average percentiles across instances.
RED for services, USE for resources. Every service needs Rate, Errors, Duration panels. Infrastructure metrics belong on separate debug dashboards, not the on-call view.
Three-tier dashboards: Overview, Service, Debug. Start broad, drill down. The on-call dashboard should answer "is anything broken?" in under 5 seconds.
Always alert on absent(). A dead service emits no metrics. "No data" is not "no problem."
Dashboard-as-code is not optional. If your dashboards live only in the Grafana UI, they have no history, no review, and no disaster recovery. Store JSON in Git.
Metrics, logs, traces are not three tools — they're one workflow. Prometheus spike leads to Loki logs leads to Tempo traces. Configure the drill-down links. The seconds you save during an incident justify the hours of setup.

The Monitoring That Lied — Deep dive on all the ways metrics deceive you
Prometheus and the Art of Not Alerting — Alert design philosophy
SLOs: When Good Enough Is a Number — Error budgets and SLO-based alerting
Log Pipelines: From Printf to Dashboard — The full logging stack
OpenTelemetry: Following a Request Across Services — Distributed tracing deep dive

Grafana: Dashboards That Don't Lie

The Mission¶

Part 1: Why That Dashboard Lied¶

Part 2: The Frameworks — USE and RED¶

RED Method — For Services¶

USE Method — For Resources¶

The Four Golden Signals¶

Flashcard Check¶

Part 3: Panel Types — Choosing the Right Visualization¶

The Decision Table¶

Time Series: The Workhorse¶

Heatmaps: Seeing the Distribution¶

Part 4: PromQL for Dashboards That Tell the Truth¶

rate() — The Foundation¶

histogram_quantile() — Percentiles Done Right¶

increase() — Total Count Over a Window¶

predict_linear() — Seeing the Future¶

Flashcard Check¶

Part 5: Variable Templates — One Dashboard, Every Environment¶

Setting Up a Namespace Variable¶

Chaining Variables¶

Part 6: The Three-Tier Dashboard Architecture¶

Tier 1: Overview — "Is anything broken?"¶

Tier 2: Service — "What's broken in this service?"¶

Tier 3: Debug — "Why is this specific thing broken?"¶

Part 7: Alerting — Waking Humans for the Right Reasons¶

Unified Alerting (Grafana 9+)¶

Building an Alert Rule¶

Notification Policies¶

Silences — When You Need Quiet¶

Part 8: Dashboard-as-Code — Stop Clicking, Start Committing¶

Provisioning with YAML + JSON¶

Grafonnet / Jsonnet¶

Terraform Provider¶

Part 9: Loki and LogQL — When Metrics Aren't Enough¶

LogQL in 60 Seconds¶

The Power Move: Metrics-to-Logs Drill-Down¶

Part 10: Tempo — Following a Request Across Services¶

Connecting Metrics to Traces with Exemplars¶

TraceQL — Querying Traces¶

Part 11: Anti-Patterns — Dashboards That Actively Hurt You¶

Too Many Panels¶

Percentile Misuse¶

Counter Resets¶

Missing absent() Alerts¶

Flashcard Check¶

Exercises¶

Exercise 1: Spot the Lie (2 minutes)¶

Exercise 2: Build a RED Panel Set (10 minutes)¶

Exercise 3: Dashboard Autopsy (15 minutes)¶

Cheat Sheet¶

PromQL Quick Reference¶

Panel Type Selection¶

Dashboard Design Rules¶

LogQL Quick Reference¶

Takeaways¶

Related Lessons¶

Pages that link here¶

Missing `absent()` Alerts¶