Skip to content

Grafana: Dashboards That Don't Lie

  • lesson
  • grafana-dashboard-design
  • promql-for-dashboards
  • panel-types
  • variable-templates
  • alerting
  • dashboard-as-code
  • loki/logql
  • tempo-integration
  • observability-anti-patterns
  • l2 ---# Grafana — Dashboards That Don't Lie

Topics: Grafana dashboard design, PromQL for dashboards, panel types, variable templates, alerting, dashboard-as-code, Loki/LogQL, Tempo integration, observability anti-patterns Level: L2 (Operations) Time: 50–70 minutes Prerequisites: None required (Prometheus basics explained inline)


The Mission

It's 2:47 AM. PagerDuty fires. Customers are reporting failed checkouts. You open the on-call dashboard — the one your team built six months ago with 47 panels.

Everything is green. CPU at 22%. Memory at 58%. Error rate 0.3%. Latency 120ms.

You flip to the Slack channel. Fifteen customers posted screenshots of 500 errors in the last ten minutes. Support says the checkout API is "completely broken."

You stare at the dashboard. It stares back. All green.

The dashboard is lying. Not because someone configured it wrong on purpose — but because the panels are answering the wrong questions, averaging away the signal, and hiding the outage behind comfortable numbers.

Your mission: understand why dashboards lie, then build ones that don't.


Part 1: Why That Dashboard Lied

Before we build anything, let's autopsy the dashboard that showed green during an outage.

The checkout service runs on 8 pods. Seven are healthy. One is in a crash loop, returning 500 errors on every request. Kubernetes keeps restarting it, so it's technically "up" most of the time.

Here's what the dashboard showed and why it was wrong:

Panel Showed Reality Why it lied
CPU 22% avg 22% avg Correct but irrelevant — users don't care about CPU
Memory 58% 58% Same — a resource metric, not a user-experience metric
Error rate 0.3% 12.5% on the broken pod Averaged across all 8 pods, the broken pod's errors vanished
Latency 120ms p99 was 8 seconds Panel showed average, not percentiles

War Story: This exact pattern — averaging hides the spike — is one of the most common dashboard failures in production. A team at a fintech company reported "dashboards showed green" during a 23-minute outage that cost them $180K in failed transactions. The root cause: their error rate panel used avg(rate(http_errors_total[5m])) across 12 instances. One instance had a 100% error rate; the other 11 were at 0%. The average: 8.3%, which their alert threshold of 10% didn't catch. After the incident, they switched to sum(rate(errors[5m])) / sum(rate(total[5m])) — the global error rate, not the average of per-instance rates.

The fix isn't "add more panels." It's asking better questions.


Part 2: The Frameworks — USE and RED

Two frameworks tell you which questions to ask. You need both.

RED Method — For Services

Rate, Errors, Duration. Created by Tom Wilkie at Grafana Labs. Answers: "How is the service doing from the user's perspective?"

# Rate: requests per second
sum(rate(http_requests_total{service="checkout"}[5m]))

# Errors: error rate as a percentage
sum(rate(http_requests_total{service="checkout", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="checkout"}[5m]))

# Duration: p99 latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{service="checkout"}[5m])) by (le)
)

Every service should have a RED dashboard. If a service doesn't have one, you're flying blind.

USE Method — For Resources

Utilization, Saturation, Errors. Created by Brendan Gregg at Netflix. Answers: "Is the infrastructure keeping up?"

# Utilization: CPU usage percentage
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

# Saturation: CPU run queue length (load average)
node_load1

# Errors: disk I/O errors
rate(node_disk_io_errors_total[5m])

Remember: RED for services (request-oriented), USE for infrastructure (resource-oriented). Mnemonic: RED lights for apps, USE tools for hardware.

The Four Golden Signals

Google's SRE book defines four golden signals: latency, traffic, errors, saturation. RED covers the first three for services. USE covers utilization, saturation, and errors for resources. Together, they're comprehensive.

Framework Scope Signals Creator
RED Services Rate, Errors, Duration Tom Wilkie (Grafana Labs)
USE Resources Utilization, Saturation, Errors Brendan Gregg (Netflix)
4 Golden Signals Both Latency, Traffic, Errors, Saturation Google SRE book

Flashcard Check

Question Answer
What does RED stand for? Rate, Errors, Duration — for services
What does USE stand for? Utilization, Saturation, Errors — for resources
Who created the RED method? Tom Wilkie (Grafana Labs)
When do you use RED vs USE? RED for request-driven services, USE for infrastructure resources (CPU, disk, network)

Part 3: Panel Types — Choosing the Right Visualization

Grafana has many panel types. Using the wrong one is like using a screwdriver as a hammer — you can, but you shouldn't.

The Decision Table

Panel Type Use When Example Don't Use When
Time series Showing trends over time Request rate, latency percentiles Displaying a single current value
Stat One number matters right now Total requests today, current uptime You need to see trends
Gauge Value against a known range CPU at 73%, disk 85% full The max value is unknown or unbounded
Table Comparing multiple items Top 10 endpoints by error rate You need to see time trends
Heatmap Distribution over time Latency distribution (where are requests clustering?) Fewer than ~100 data points
Logs Correlating events with metrics Error logs alongside latency spikes High-volume log streams without filtering

Time Series: The Workhorse

Most panels on most dashboards are time series. A few guidelines:

  • Show p50, p95, and p99 on the same panel. Three lines, one glance.
  • Use the right unit. Grafana can auto-format seconds, bytes, percentages — set it.
  • Use $__rate_interval instead of hardcoding [5m]. It auto-adjusts to the dashboard's time range and Prometheus scrape interval.
# p50, p95, p99 on one panel — three queries, aliased
# Query A (alias: p50):
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le))

# Query B (alias: p95):
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le))

# Query C (alias: p99):
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le))

Gotcha: $__rate_interval was introduced in Grafana 7.2. It calculates the minimum safe range for rate() based on your scrape interval and resolution. Before this, people hardcoded [5m] and got either noisy graphs (too short) or smoothed-away spikes (too long). If you're using an older Grafana, $__interval is the next best thing, but $__rate_interval is preferred.

Heatmaps: Seeing the Distribution

A time series panel showing p99 latency tells you one number. A heatmap shows you the entire distribution — where most requests cluster, and whether the tail is a thin spike or a wide plateau.

# Heatmap query for latency distribution
sum(increase(http_request_duration_seconds_bucket[$__rate_interval])) by (le)

Set the panel to "Heatmap" format, Y-axis to the bucket boundaries, and the color scheme to something where hot spots jump out. When you see a bimodal distribution (two bright bands), it usually means two different code paths are serving the same endpoint.

Trivia: Grafana was created by Torkel Odegaard in 2014 as a fork of Kibana 3's dashboard panel. He wanted better visualization for Graphite metrics. The name "Grafana" is a portmanteau — he originally misspelled "Graphite" + "Kibana" and the name stuck. By 2024, Grafana had over 20 million users and Grafana Labs was valued at $6 billion.


Part 4: PromQL for Dashboards That Tell the Truth

PromQL is where dashboards get their honesty — or their lies. Here are the queries that matter, with explanations of what each piece does.

rate() — The Foundation

rate() calculates the per-second increase of a counter over a time window.

rate(http_requests_total{service="checkout"}[5m])
Piece What it does
http_requests_total Counter metric (only goes up, resets on restart)
{service="checkout"} Label filter — only the checkout service
[5m] Look back 5 minutes for data points
rate(...) Per-second increase, averaged over the window

Under the Hood: rate() handles counter resets. When Prometheus detects a counter value decreasing (process restarted), it assumes a reset and compensates. This is why you never alert on raw counter values — they drop to zero on restart, making rate() briefly unreliable. Use rate() over at least 4x your scrape interval (for 15s scrape, use [1m] minimum, [5m] for stable alerting).

histogram_quantile() — Percentiles Done Right

The query that replaces misleading averages.

histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
Piece What it does
http_request_duration_seconds_bucket Histogram buckets (each le label is a boundary)
rate(...[5m]) Per-second rate of observations falling into each bucket
sum(...) by (le) Aggregate across all instances, keeping bucket boundaries
histogram_quantile(0.99, ...) Estimate the value at the 99th percentile

Gotcha: histogram_quantile interpolates linearly between bucket boundaries. If your SLO is "99% of requests under 200ms" but your buckets jump from 100ms to 250ms, the p99 calculation is an approximation that could be significantly off. Always add bucket boundaries at your SLO thresholds:

// Go prometheus client — custom buckets
prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Buckets: []float64{.005, .01, .025, .05, .1, .15, .2, .25, .3, .5, 1, 2.5, 5, 10},
}

increase() — Total Count Over a Window

# How many 5xx errors in the last hour?
increase(http_requests_total{status=~"5.."}[1h])

Good for stat panels showing "137 errors in the last hour."

predict_linear() — Seeing the Future

# Will this disk fill up in the next 4 hours?
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0

This takes 6 hours of historical data, fits a linear regression, and extrapolates 4 hours forward. If the predicted value is negative (disk full), fire an alert. This is the classic "disk filling up" alerting query — reactive monitoring notices when the disk is 90% full; predictive monitoring notices when it will be full on Tuesday.

Flashcard Check

Question Answer
Why use rate() instead of raw counter values? Counters reset on restart; rate() handles resets and gives a per-second rate
What does by (le) do in histogram queries? Preserves bucket boundaries so histogram_quantile can compute percentiles
Why is averaging percentiles wrong? The average of p99 values across instances is NOT the true p99 of all requests combined
What does predict_linear() do? Fits a linear regression on historical data and extrapolates to a future timestamp

Part 5: Variable Templates — One Dashboard, Every Environment

Hardcoding {namespace="production"} in every query means you need separate dashboards for dev, staging, and production. Variables fix this.

Setting Up a Namespace Variable

In Dashboard Settings > Variables, create a query variable:

Setting Value
Name namespace
Type Query
Data source Prometheus
Query label_values(up, namespace)
Multi-value Enabled
Include All Enabled

Now use $namespace in every query:

sum(rate(http_requests_total{namespace=~"$namespace"}[5m])) by (service)

The =~ (regex match) handles multi-select. When the user picks "All," Grafana substitutes a regex matching everything.

Chaining Variables

Variables can depend on each other. A service variable that filters based on the selected namespace:

label_values(up{namespace=~"$namespace"}, service)

Now the service dropdown only shows services in the selected namespace. This is how you build a single dashboard that works for 50 services across 3 environments.

Mental Model: Think of variables as parameterized queries. A dashboard without variables is a report about one thing. A dashboard with variables is a tool that works on anything. The best on-call dashboards are tools, not reports.


Part 6: The Three-Tier Dashboard Architecture

Not every dashboard is for the same audience or moment. Build three tiers:

Tier 1: Overview — "Is anything broken?"

One dashboard, every service, RED metrics only. This is what the on-call checks first.

Panels: - Stat panels for each service: current error rate, colored red/green by threshold - Time series: global request rate, global error rate, global p99 latency - No per-pod detail. No infrastructure metrics. Just: are users happy?

Tier 2: Service — "What's broken in this service?"

One dashboard per service (using variables). RED metrics plus service-specific details.

Panels: - Request rate by endpoint - Error rate by endpoint and status code - Latency percentiles (p50, p95, p99) by endpoint - Recent deployments (annotation) - Pod restart count

Tier 3: Debug — "Why is this specific thing broken?"

Detailed infrastructure and application internals. Only opened during an investigation.

Panels: - Container CPU/memory per pod - Go runtime metrics (goroutines, GC pause) - Database connection pool utilization - Loki log panel filtered to the service - Tempo trace links via exemplars

Remember: Three-tier mnemonic: OSD — Overview, Service, Debug. Drill down from broad to narrow. During an incident, start at Tier 1, click through to Tier 2, then Tier 3. Each click narrows the investigation.


Part 7: Alerting — Waking Humans for the Right Reasons

Unified Alerting (Grafana 9+)

Grafana's unified alerting system replaced the old dashboard-bound alert rules. Now alerts are standalone objects with their own evaluation engine.

The key components:

Component What it does
Alert rule A query + condition + evaluation interval
Contact point Where notifications go (Slack, PagerDuty, email, webhook)
Notification policy Routing tree: which alerts go to which contact points
Silence Temporary mute during maintenance
Mute timing Recurring schedule (e.g., no alerts on weekends for non-critical)

Building an Alert Rule

For our checkout service, an alert that catches what the lying dashboard missed:

# Grafana alert rule (conceptual — created via UI or provisioning)
name: CheckoutHighErrorRate
condition: C
data:
  - refId: A
    # Total errors
    expr: sum(rate(http_requests_total{service="checkout", status=~"5.."}[$__rate_interval]))
  - refId: B
    # Total requests
    expr: sum(rate(http_requests_total{service="checkout"}[$__rate_interval]))
  - refId: C
    # Error percentage
    expr: $A / $B
    condition: gt
    threshold: 0.01  # 1% error rate
evaluation_interval: 1m
pending_period: 3m  # must be true for 3 minutes
labels:
  severity: critical
  team: platform
annotations:
  summary: "Checkout error rate is {{ $values.C | humanizePercentage }}"
  runbook_url: "https://wiki.internal/runbooks/checkout-errors"

Notice the critical difference from the lying dashboard: this computes the global error rate (sum(errors) / sum(total)), not the average of per-instance rates. One pod returning 100% errors out of 8 pods produces a 12.5% global error rate — well above the 1% threshold.

Notification Policies

Route alerts to the right people:

# Notification policy tree
policies:
  - receiver: slack-general
    group_by: [alertname, namespace]
    group_wait: 30s
    routes:
      - match:
          severity: critical
        receiver: pagerduty-oncall
        continue: true  # also send to Slack
      - match:
          severity: critical
        receiver: slack-critical
      - match:
          severity: warning
        receiver: slack-warnings
        group_interval: 15m

Gotcha: The continue: true flag means "keep matching subsequent routes after this one." Without it, the first match wins and routing stops. Use continue: true when you want an alert to hit multiple receivers (PagerDuty AND Slack). Omit it for mutually exclusive routing. Test your routing with amtool config routes test if you're using Alertmanager directly.

Silences — When You Need Quiet

Planned maintenance at 3 AM? Silence the disk alerts before you start:

# Alertmanager CLI
amtool silence add alertname="DiskFillingUp" instance="node3:9100" \
  --comment="Disk replacement on node3" --duration=4h

# Or via Grafana UI: Alerting > Silences > New Silence

Part 8: Dashboard-as-Code — Stop Clicking, Start Committing

A dashboard configured by hand in the Grafana UI has no history, no review process, and no way to recover if someone accidentally deletes it at 2 AM.

Provisioning with YAML + JSON

Grafana reads provisioning files on startup from /etc/grafana/provisioning/.

Data source provisioning:

# /etc/grafana/provisioning/datasources/prometheus.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: "15s"   # Match your scrape interval
      exemplarTraceIdDestinations:
        - name: traceID
          datasourceUid: tempo

Dashboard provisioning:

# /etc/grafana/provisioning/dashboards/default.yaml
apiVersion: 1
providers:
  - name: default
    type: file
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

Then place dashboard JSON files in /var/lib/grafana/dashboards/. Here is a minimal but complete dashboard JSON for a RED method overview:

{
  "dashboard": {
    "title": "Checkout Service — RED",
    "uid": "checkout-red-v1",
    "tags": ["service", "checkout", "red"],
    "timezone": "utc",
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 8, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"checkout\"}[$__rate_interval]))",
            "legendFormat": "req/sec"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "reqps"}
        }
      },
      {
        "title": "Error Rate",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 8, "x": 8, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"checkout\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{service=\"checkout\"}[$__rate_interval]))",
            "legendFormat": "error %"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "percentunit", "max": 1, "thresholds": {
            "steps": [
              {"color": "green", "value": null},
              {"color": "yellow", "value": 0.01},
              {"color": "red", "value": 0.05}
            ]
          }}
        }
      },
      {
        "title": "Latency Percentiles",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 8, "x": 16, "y": 0},
        "targets": [
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"checkout\"}[$__rate_interval])) by (le))",
            "legendFormat": "p50"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"checkout\"}[$__rate_interval])) by (le))",
            "legendFormat": "p95"
          },
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"checkout\"}[$__rate_interval])) by (le))",
            "legendFormat": "p99"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "s"}
        }
      }
    ],
    "templating": {
      "list": [
        {
          "name": "namespace",
          "type": "query",
          "query": "label_values(up, namespace)",
          "multi": true,
          "includeAll": true
        }
      ]
    },
    "time": {"from": "now-1h", "to": "now"},
    "refresh": "30s"
  }
}

Store this in Git. Deploy via CI. Never hand-edit in the UI again.

Grafonnet / Jsonnet

For teams managing dozens of dashboards, JSON gets unwieldy. Grafonnet is a Jsonnet library for generating Grafana dashboard JSON programmatically:

// checkout-dashboard.jsonnet
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;

dashboard.new(
  'Checkout Service — RED',
  tags=['service', 'checkout', 'red'],
  time_from='now-1h',
  refresh='30s',
)
.addPanel(
  graphPanel.new(
    'Request Rate',
    datasource='Prometheus',
  ).addTarget(
    prometheus.target(
      'sum(rate(http_requests_total{service="checkout"}[$__rate_interval]))',
      legendFormat='req/sec',
    )
  ), gridPos={h: 8, w: 8, x: 0, y: 0}
)

Compile with jsonnet -J vendor checkout-dashboard.jsonnet > checkout-dashboard.json. The generated JSON gets provisioned as before.

Terraform Provider

For Grafana instances managed as infrastructure:

# grafana.tf
resource "grafana_dashboard" "checkout_red" {
  config_json = file("${path.module}/dashboards/checkout-red.json")
  folder      = grafana_folder.services.id
  overwrite   = true
}

resource "grafana_data_source" "prometheus" {
  type = "prometheus"
  name = "Prometheus"
  url  = "http://prometheus:9090"

  json_data_encoded = jsonencode({
    timeInterval = "15s"
  })
}

resource "grafana_contact_point" "pagerduty" {
  name = "pagerduty-oncall"
  pagerduty {
    integration_key = var.pagerduty_key
    severity        = "critical"
  }
}

Mental Model: Dashboard-as-code has three levels of maturity: 1. Manual — Click in the UI. No history. Fragile. 2. Provisioned — JSON in Git, loaded on startup. Reviewable, recoverable. 3. Generated — Jsonnet/Terraform generates JSON from templates. Consistent, scalable.

Most teams should aim for level 2. Level 3 pays off when you have 20+ dashboards with shared patterns.


Part 9: Loki and LogQL — When Metrics Aren't Enough

Metrics tell you what is wrong. Logs tell you why. Grafana bridges both.

LogQL in 60 Seconds

# Select a log stream by labels
{namespace="production", app="checkout"}

# Filter by content
{app="checkout"} |= "error"              # contains "error"
{app="checkout"} != "healthcheck"         # exclude healthchecks
{app="checkout"} |~ "status=(4|5).."      # regex match

# Parse JSON and filter on fields
{app="checkout"} | json | status_code >= 500

# Metrics from logs — count errors per minute
sum(rate({app="checkout"} |= "error" [5m]))

# Top error endpoints from structured logs
sum by (path) (count_over_time({app="checkout"} | json | status_code >= 500 [1h]))

Name Origin: Loki is named after the Norse trickster god — fitting because it's deceptively lightweight. Unlike Elasticsearch, which indexes every word in every log line, Loki only indexes the labels (namespace, app, pod). The log content is stored as compressed chunks and only searched when you query. This makes Loki dramatically cheaper to operate — often 10x less infrastructure cost than an equivalent Elasticsearch cluster.

The Power Move: Metrics-to-Logs Drill-Down

In Grafana, you can click a spike on a Prometheus time series panel and split the view to show Loki logs from the exact same time range. Configure it with a "derived field" or by setting up data links between Prometheus and Loki panels.

This is the killer workflow during incidents: 1. Tier 1 dashboard: notice error rate spike (Prometheus) 2. Click the spike: see error logs from that time window (Loki) 3. Spot a stack trace with a trace ID 4. Click the trace ID: see the full distributed trace (Tempo)

Three data sources, one investigation flow, seconds instead of minutes.


Part 10: Tempo — Following a Request Across Services

Grafana Tempo stores distributed traces in object storage (S3, GCS) without requiring a separate indexing database. This makes it cheaper than Jaeger with Elasticsearch at scale.

Connecting Metrics to Traces with Exemplars

Exemplars attach a trace ID to specific metric observations. When you see a latency spike in a histogram, exemplars let you click through to the exact trace that caused it.

Enable exemplars in your Prometheus data source configuration:

# In the data source provisioning YAML
jsonData:
  exemplarTraceIdDestinations:
    - name: traceID
      datasourceUid: tempo

Now, on a time series panel showing latency, small diamonds appear at outlier data points. Click one, and Grafana opens the trace in Tempo. No more guessing which request was slow.

TraceQL — Querying Traces

# Find slow checkout requests with errors
{ span.http.target = "/api/checkout" && status = error && duration > 2s }

Trivia: The "PLT stack" — Prometheus, Loki, Tempo — is Grafana Labs' answer to the commercial observability platforms. All three use the same label-based data model, making correlation across metrics, logs, and traces seamless. The stack is sometimes called "LGTM" (Loki, Grafana, Tempo, Mimir) in marketing materials — an intentional pun on code review approvals.


Part 11: Anti-Patterns — Dashboards That Actively Hurt You

Too Many Panels

A dashboard with 50 panels takes 30 seconds to load. During an incident, you scroll past 47 irrelevant panels to find the 3 that matter. By the time you find them, it's been 5 minutes.

Rule of thumb: 12 panels maximum on an overview dashboard. If you need more, you need another tier, not another row.

Percentile Misuse

# WRONG: average of per-instance p99 values
avg(histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))

# RIGHT: p99 of all requests across all instances
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

The wrong query averages p99 values. Two instances with p99 of 100ms and 900ms do not average to a meaningful global p99. Aggregate the buckets first, then compute the percentile.

Counter Resets

A service restarts. The counter drops from 150,000 to 0. rate() handles this correctly in steady state — but for 1-2 scrape intervals after a restart, the rate calculation can produce brief artifacts. A dashboard showing irate() (instant rate, last two samples only) will show a spike at the restart boundary.

Fix: Use rate() (averaged over a window) for dashboards and alerting, not irate(). irate() is for interactive exploration where you want maximum responsiveness.

Missing absent() Alerts

Your service crashes. It stops emitting metrics. Your error rate alert evaluates to "no data" — and most alert configurations treat "no data" as "not firing." The service is down, and nobody knows.

# Always pair RED alerts with an absent() check
- alert: CheckoutMetricsMissing
  expr: absent(up{job="checkout"} == 1)
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Checkout service is not being scraped  service may be down"

Flashcard Check

Question Answer
Why is avg(per-instance p99) wrong? You can't average percentiles — aggregate the histogram buckets first, then compute the quantile
What does absent() detect? When a metric has completely vanished (not just zero — gone)
Why use rate() over irate() for alerting? rate() averages over a window for stability; irate() uses only the last two samples, making it noisy
What happens to alerts when a target stops emitting? Most expressions evaluate to "no data," which doesn't fire the alert — leading to silent failures

Exercises

Exercise 1: Spot the Lie (2 minutes)

A dashboard shows avg(rate(http_requests_total{status="500"}[5m])) as the error rate. There are 10 pods. Nine have zero errors. One has a 50% error rate.

What does the dashboard show? What is the actual global error rate?

Answer The dashboard shows 5% (the average of nine 0% values and one 50% value). But if the broken pod handles 10% of traffic, the global error rate is 5%. If it handles 1% of traffic, the global error rate is 0.5%. The `avg()` query doesn't weight by traffic volume — it treats a pod handling 1 request/sec the same as a pod handling 1,000 request/sec. The correct query: `sum(rate(errors[5m])) / sum(rate(total[5m]))` weights by actual traffic.

Exercise 2: Build a RED Panel Set (10 minutes)

Create three Grafana panels for a service called payment-api: 1. Request rate (requests per second) 2. Error rate (percentage of 5xx responses) 3. Latency percentiles (p50, p95, p99)

Write the PromQL for each. Use $__rate_interval and a $namespace variable.

Solution
# Request rate
sum(rate(http_requests_total{service="payment-api", namespace=~"$namespace"}[$__rate_interval]))

# Error rate
sum(rate(http_requests_total{service="payment-api", namespace=~"$namespace", status=~"5.."}[$__rate_interval]))
/
sum(rate(http_requests_total{service="payment-api", namespace=~"$namespace"}[$__rate_interval]))

# p50
histogram_quantile(0.50,
  sum(rate(http_request_duration_seconds_bucket{service="payment-api", namespace=~"$namespace"}[$__rate_interval])) by (le))

# p95
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket{service="payment-api", namespace=~"$namespace"}[$__rate_interval])) by (le))

# p99
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{service="payment-api", namespace=~"$namespace"}[$__rate_interval])) by (le))

Exercise 3: Dashboard Autopsy (15 minutes)

Your on-call dashboard has these panels: - CPU usage per node (gauge) - Memory usage per node (gauge) - Network bytes in/out (time series) - Disk IOPS (time series) - Pod count by namespace (stat) - 47 more infrastructure metrics

A user reports 504 Gateway Timeout errors. How long would it take you to diagnose the issue using this dashboard? What panels would you add, and what would you remove?

Hint This is a USE-only dashboard with no RED metrics. It tells you about infrastructure health but nothing about user experience. A user could be getting 504s while every infrastructure panel is green (the problem might be in application logic, not resource exhaustion).
Solution approach Add RED panels at the top: request rate, error rate (especially 504s), latency percentiles. Add a table showing error rate by service to identify which service is producing 504s. Move infrastructure panels to a separate Tier 3 debug dashboard. The redesigned dashboard should answer "is anything broken for users?" in under 5 seconds. The current dashboard cannot answer that question at all.

Cheat Sheet

PromQL Quick Reference

Query What it does
rate(counter[5m]) Per-second rate of increase over 5 minutes
increase(counter[1h]) Total increase over 1 hour
histogram_quantile(0.99, sum(rate(buckets[5m])) by (le)) Global p99 latency
sum(rate(errors[5m])) / sum(rate(total[5m])) Global error rate (weighted by traffic)
predict_linear(gauge[6h], 4*3600) < 0 Will this value hit zero in 4 hours?
absent(up{job="x"} == 1) Is this target completely gone?
topk(5, sum by (handler) (rate(total[5m]))) Top 5 endpoints by request rate

Panel Type Selection

You want to show... Use this panel
A trend over time Time series
One important number Stat
A value against a known range Gauge
Ranking or comparison of items Table
Distribution of values over time Heatmap
Event context alongside metrics Logs

Dashboard Design Rules

Rule Why
RED for services, USE for resources Covers both user experience and infrastructure
sum/sum not avg for error rates Averages hide broken instances
Percentiles not averages for latency Averages hide tail latency
12 panels max per overview dashboard More panels = slower load = slower incident response
Always add absent() alerts Detect silent failures when metrics vanish
Variables for namespace/service One dashboard works everywhere
Store dashboard JSON in Git History, review, recovery

LogQL Quick Reference

Pattern What it does
{app="x"} Select log stream by label
\|= "error" Filter: line contains "error"
!= "healthcheck" Exclude: line contains "healthcheck"
\|~ "status=(4\|5).." Regex filter
\| json \| status >= 500 Parse JSON, filter on field
count_over_time({app="x"} \|= "error" [5m]) Count matching lines over 5 minutes

Takeaways

  1. Averages lie. Use sum(errors)/sum(total) for error rates, histogram_quantile for latency. Never average percentiles across instances.

  2. RED for services, USE for resources. Every service needs Rate, Errors, Duration panels. Infrastructure metrics belong on separate debug dashboards, not the on-call view.

  3. Three-tier dashboards: Overview, Service, Debug. Start broad, drill down. The on-call dashboard should answer "is anything broken?" in under 5 seconds.

  4. Always alert on absent(). A dead service emits no metrics. "No data" is not "no problem."

  5. Dashboard-as-code is not optional. If your dashboards live only in the Grafana UI, they have no history, no review, and no disaster recovery. Store JSON in Git.

  6. Metrics, logs, traces are not three tools — they're one workflow. Prometheus spike leads to Loki logs leads to Tempo traces. Configure the drill-down links. The seconds you save during an incident justify the hours of setup.