Quiz: Prometheus Deep Dive¶

6 questions

L1 (4 questions)¶

1. You add a request_id label (unique per request) to your http_requests_total counter. Within hours, Prometheus memory spikes and queries slow to a crawl. What happened and how do you fix it?

Show answer

You caused a cardinality explosion. Every unique label combination creates a new time series. With a unique request_id per request, you created millions of series. Prometheus stores each series independently with its own index entry, TSDB chunk, and memory overhead. Fix:
1. Remove the request_id label immediately.
2. Wait for old series to fall outside the retention window (or compact manually).
3. Use Prometheus metric relabel_configs to drop high-cardinality labels at scrape time as a safety net. Prevention: never use unbounded values (IDs, timestamps, emails, URLs) as label values. Bound label cardinality to hundreds, not thousands.

2. A critical dashboard shows 'No Data' for a metric that was present yesterday. The scrape target shows UP. What are the most likely causes?

Show answer

1. Metric renamed or labels changed — the PromQL query references old metric/label names. Check /metrics on the target directly.
2. Relabel config dropped it — a new relabel_configs or metric_relabel_configs rule accidentally filters the metric.
3. Target restart reset counters — rate() over a window where the counter resets can produce NaN if the window is too short.
4. Recording rule changed — if the dashboard queries a recording rule, check if the rule was modified.
5. Retention expired — if you reduced retention, old data is compacted away. Debug: query the raw metric name in Prometheus UI, check the /targets page for scrape errors, and review recent config changes.

3. A counter metric resets to 0 (service restart). Your alert on rate(errors_total[5m]) > 0.1 fires briefly then clears. Should you tune the alert to ignore resets?

Show answer

rate() already handles counter resets — it detects the decrease and treats the new value as a continuation from the last known value, not a spike. However, if the window is shorter than two scrape intervals after restart, rate() may produce unexpected results (NaN or a spike from extrapolation). The brief firing is likely from the rate calculation during the window where only one post-restart sample exists. Fix:
1. Use 'for: 5m' on the alert to require sustained threshold breach.
2. Ensure the rate window is at least 4x the scrape interval.
3. Do NOT use increase() or resets() to try to compensate — rate() already accounts for resets correctly.

4. You have a PromQL query used in a critical alert that takes 30 seconds to execute across millions of time series. How do recording rules help, and how would you set one up?

Show answer

Recording rules precompute PromQL expressions at regular intervals and store the result as a new time series. Instead of computing the expensive query every alert evaluation, the alert references the precomputed metric. Setup: in prometheus.yml rules file, add: groups: - name: precomputed, interval: 1m, rules: - record: job:http_errors:rate5m, expr: sum(rate(http_errors_total[5m])) by (job). Then alert on job:http_errors:rate5m > threshold. Benefits: alert evaluation drops from 30s to milliseconds, Grafana dashboards using the same expression also benefit, and you reduce load on Prometheus during peak query times. Trade-off: adds one more time series per recording rule and introduces up to one evaluation-interval of staleness.

L2 (2 questions)¶

1. histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) returns 0.45s, but your application logs show p99 is 0.62s. Why might the Prometheus value be inaccurate?

Show answer

histogram_quantile interpolates linearly within the bucket that contains the quantile boundary. If your buckets are coarse (e.g., 0.1, 0.25, 0.5, 1.0), the p99 falls in the 0.25-0.5 bucket and Prometheus interpolates between those boundaries. With wide buckets, the interpolation is imprecise. Additionally:
1. Client-side quantile computation in application logs uses exact values; Prometheus uses bucket approximation.
2. The rate() window matters — 5m smooths out spikes that instantaneous measurements capture. Fix: add finer-grained buckets around your SLO target (e.g., 0.3, 0.4, 0.5, 0.6, 0.7 for a 500ms SLO). More buckets = more series = more cost, so be selective.

2. You deploy two Prometheus instances with identical configs for HA. Both fire the same alerts. How does AlertManager handle deduplication, and what external_labels should you set?

Show answer

AlertManager deduplicates alerts based on the alert name + label set. If both Prometheus instances send identical alerts (same name, same labels), AlertManager treats them as one alert and sends one notification. Set external_labels to distinguish instances: e.g., external_labels: { replica: 'prometheus-1' } and { replica: 'prometheus-2' }. But then alerts won't deduplicate because labels differ. The solution: use AlertManager's group_by to group alerts by labels excluding 'replica', or use Thanos/Cortex ruler for deduplication. In practice: keep external_labels for data identification but configure AlertManager grouping to ignore the replica label for notification dedup.