Grafana¶

33 cards — 🟢 5 easy | 🟡 9 medium | 🔴 6 hard

🟢 Easy (5)¶

1. What is a Grafana dashboard?

Show answer

A collection of panels that visualize metrics from data sources like Prometheus, Loki, and others.

Who made it: Torkel Odegaard created Grafana in 2014 as a fork of Kibana. Now maintained by Grafana Labs.

Name origin: Grafana is a portmanteau of graph and Kibana (its ancestor).

Remember: Grafana is the visualization layer. It does not store data — it queries data sources like Prometheus, Loki, Tempo.

2. What are the main panel types available in Grafana?

Show answer

Graphs, gauges, tables, heatmaps, stat panels, and more. Each panel is an individual visualization within a dashboard.

Remember: Choose panel types intentionally — stat for single values, gauge for thresholds, time series for trends, table for raw data, heatmap for distributions.

Gotcha: Too many panel types on one dashboard creates cognitive overload. Stick to 2-3 types per dashboard.

3. What is a data source in Grafana?

Show answer

A backend that Grafana queries for data, such as Prometheus (metrics), Loki (logs), Tempo (traces), Elasticsearch, or CloudWatch.

Example: PLT stack — Prometheus (metrics), Loki (logs), Tempo (traces). One Grafana UI for correlated investigation.

4. How do you create a dashboard variable that auto-populates from a Prometheus label?

Show answer

In Dashboard Settings > Variables, add a query variable with data source Prometheus and query label_values(up, namespace). This populates a dropdown with all namespace values. Use $namespace in panel queries to filter. Multi-value and "All" options let users select multiple values or everything at once.

5. When should you use a stat panel vs a gauge vs a time series graph in Grafana?

Show answer

Stat panel: single current value with optional sparkline (e.g., total requests, uptime percentage).
Gauge: value against a min/max range with color thresholds (e.g., CPU at 73%, disk 85% full).
Time series graph: values over time for trend analysis (e.g., request rate, latency percentiles). Use the right panel type to make dashboards scannable at a glance.

🟡 Medium (9)¶

1. What are Grafana variables and why are they useful?

Show answer

Variables are dropdown parameters (e.g., $namespace, $pod) that make dashboards reusable across environments, so one dashboard works for dev, staging, and production.

Example: variable from label_values(up, namespace). Use $namespace in queries. One dashboard serves all environments.

2. What are the three tiers of dashboard design recommended for observability?

Show answer

1) Overview dashboards for high-level health (RED/USE method), 2) Service dashboards for per-service metrics, 3) Debugging dashboards for detailed troubleshooting.

Remember: Three-tier dashboard design mnemonic: OSD — Overview, Service, Debug. Drill down from broad to narrow.

Example: Tier 1 shows all services with RED metrics. Click a service for Tier 2. Click an anomaly for Tier 3 pod-level detail.

3. How does Grafana handle alerting in version 8 and later?

Show answer

Grafana v8+ can evaluate alert rules directly within Grafana, in addition to using external Alertmanager. Alert rules are defined against data source queries and can route to notification channels.

Name origin: Grafana alerting was historically delegated to Alertmanager. Unified alerting (v8+) brought it in-house.

Gotcha: Migrating from legacy alerting to unified alerting can break existing notification channels. Test in staging first.

4. What are the RED and USE methods for dashboard design?

Show answer

RED (Rate, Errors, Duration) is for monitoring services. USE (Utilization, Saturation, Errors) is for monitoring resources like CPU, memory, and disk.

Remember: RED for services (request-oriented), USE for infrastructure (resource-oriented). Mnemonic: RED lights for apps, USE tools for hardware.

Who made it: Tom Wilkie (Grafana Labs) popularized RED. Brendan Gregg (Netflix) created USE.

5. How do you create a Grafana alert rule that fires when error rate exceeds 5%?

Show answer

Create an alert rule with two queries: A = sum(rate(http_errors_total[5m])) and B = sum(rate(http_requests_total[5m])). Add a math expression C = $A / $B. Set the condition to C > 0.05 with an evaluation interval of 1m and a pending period of 5m. Configure a contact point (Slack, PagerDuty) and notification policy for routing.

6. What are the key settings when configuring Prometheus as a Grafana data source?

Show answer

URL (http://prometheus:9090), access mode (server = Grafana backend proxies, browser = direct from user), scrape interval (match your Prometheus config, typically 15s), and optional auth (basic auth or TLS client certs). Set the scrape interval correctly so $__rate_interval works properly in queries. Test with the "Save & Test" button.

7. How do you provision Grafana dashboards as code for GitOps workflows?

Show answer

Place YAML config files in /etc/grafana/provisioning/dashboards/ specifying a folder path containing JSON dashboard files. On startup, Grafana loads and syncs these dashboards automatically. Store dashboard JSON in Git, deploy via CI/CD. Use grafana-dashboard-provider to watch a folder and auto-reload on changes. This eliminates manual dashboard creation and ensures consistency across environments.

8. How should you organize Grafana dashboards and control access?

Show answer

Use folders to group by team or service (e.g., Platform/Infra, Backend/API, Frontend). Assign folder-level permissions: Viewer for stakeholders, Editor for owning team, Admin for platform team. This prevents accidental edits to shared dashboards while letting teams manage their own. Use dashboard tags for cross-cutting organization (e.g., tag "SLO" across multiple folders).

9. What changes when migrating from Grafana legacy alerting to unified alerting?

Show answer

Unified alerting (Grafana 9+) replaces dashboard-bound alert rules with a dedicated Alerting UI. Rules become standalone, support multi-datasource evaluation, and use a built-in Alertmanager. Migration converts existing rules but may break notification channels that need manual reconfiguration.

🔴 Hard (6)¶

1. How can Grafana dashboards and data sources be provisioned as code?

Show answer

Using YAML provisioning files placed in Grafana's provisioning directory (e.g., /etc/grafana/provisioning/). Data sources and dashboards can be defined in YAML, enabling GitOps workflows where dashboards are version-controlled and deployed automatically.

2. Why is label cardinality important when designing Grafana dashboards backed by Prometheus?

Show answer

High-cardinality labels (e.g., user_id for millions of users) create massive numbers of time series that overwhelm Prometheus storage and slow Grafana queries. Labels should be kept bounded (e.g., method, status_code, namespace).

3. What is dashboard rot and how do you prevent it?

Show answer

Dashboard rot occurs when dashboards accumulate and become stale or unused. Prevent it by deleting dashboards nobody looks at, consolidating personal dashboards into shared ones, and auditing dashboards regularly.

Gotcha: Dashboard rot is the #1 observability anti-pattern. Teams create dashboards during incidents and never delete them.

Remember: Quarterly dashboard audits: delete what nobody viewed in 90 days. Track with Grafana built-in usage stats.

4. How do annotations work in Grafana and why are they valuable for debugging?

Show answer

Annotations overlay event markers on time series graphs — deployment timestamps, config changes, incident starts. Create manually or auto-populate from queries (e.g., query Loki for deployment log lines). When latency spikes, annotations immediately show "deployed v2.3.1 at 14:32" correlating the change with the impact. Native annotations use Grafana's built-in store; query annotations pull from any data source.

5. How does Grafana integrate with Loki for log exploration alongside metrics?

Show answer

Add Loki as a data source, then use the Explore view with LogQL queries (e.g., {namespace="prod"} |= "error" | json | rate(5m)). The key power is correlation: click a spike on a Prometheus graph, and Grafana can split the view to show Loki logs from the same time range and labels. This metrics-to-logs drill-down drastically reduces investigation time during incidents.

6. How do you version-control Grafana dashboards as code and detect drift?

Show answer

Export dashboards as JSON via the API or provisioning, store in Git. Use grafana-dashboard-diff or jsonnet/grafonnet to generate dashboards deterministically. Detect drift by comparing API-exported JSON against the committed version. Provisioned dashboards from files are read-only in the UI, preventing manual drift.