Portal | Level: L2: Operations | Topics: Prometheus | Domain: Observability
PromQL Drills¶
25 drills for Prometheus query language muscle memory. Each should take 1-5 minutes.
Difficulty: [E] Easy = single function | [I] Intermediate = multiple functions or by-clause | [H] Hard = multi-step or nested
Remember: The PromQL function hierarchy:
rate()for counters (per-second average),increase()for counters (total increase over range),avg/sum/max/minfor aggregation,histogram_quantile()for percentiles. Mnemonic: "Rate for speed, Increase for total, Histogram for percentiles." Never userate()on a gauge — it is only for monotonically increasing counters.Gotcha:
rate()requires a range vector (e.g.,[5m]). The range should be at least 4x the scrape interval. With a 15s scrape interval, userate(metric[1m])minimum. Too-short ranges produce noisy, inaccurate results because rate needs at least two data points.
Drill 1: Is the target up? [E]¶
Question: Check if the grokdevops target is being scraped successfully.
Answer
`up{job="grokdevops"}` > **Under the hood:** `up` is a synthetic metric Prometheus creates for every scrape target. `1` = last scrape succeeded, `0` = last scrape failed. It is the simplest health check available and the first thing to verify when a target seems silent.Drill 2: Total requests in last hour [E]¶
Question: How many total HTTP requests has grokdevops handled in the last hour?
Answer
`increase(http_requests_total{job="grokdevops"}[1h])` > **Gotcha:** `increase()` is syntactic sugar for `rate() * seconds_in_range`. On counters that reset (e.g., pod restart), `increase()` handles the reset automatically. But if the counter resets multiple times within the range, some counts may be lost between resets.Drill 3: Request rate per second [E]¶
Question: What is the current request rate (RPS) for grokdevops?
Answer
`sum(rate(http_requests_total{job="grokdevops"}[5m]))`Drill 4: Request rate by status code [I]¶
Question: Break down the request rate by HTTP status code.
Answer
`sum(rate(http_requests_total{job="grokdevops"}[5m])) by (status)`Drill 5: Error rate percentage [I]¶
Question: What percentage of requests are returning 5xx errors?
Answer
`sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="grokdevops"}[5m])) * 100`Drill 6: p99 latency [I]¶
Question: What is the 99th percentile response time?
Answer
`histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le))`Drill 7: p50 latency [I]¶
Question: What is the median (50th percentile) response time?
Answer
`histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le))`Drill 8: CPU usage by pod [I]¶
Question: Show CPU usage (in cores) for each pod in the grokdevops namespace.
Answer
`sum(rate(container_cpu_usage_seconds_total{namespace="grokdevops"}[5m])) by (pod)`Drill 9: Memory usage in MB [E]¶
Question: Show memory usage in megabytes for grokdevops pods.
Answer
`sum(container_memory_working_set_bytes{namespace="grokdevops"}) by (pod) / 1024 / 1024`Drill 10: Pod restart count [E]¶
Question: How many times has each pod restarted in the last hour?
Answer
`increase(kube_pod_container_status_restarts_total{namespace="grokdevops"}[1h])`Drill 11: Target disappeared [I]¶
Question: Write a query that fires when the grokdevops target stops being scraped entirely.
Answer
`absent(up{job="grokdevops"})`Drill 12: Top 5 CPU consumers [I]¶
Question: Show the top 5 pods by CPU usage across all namespaces.
Answer
`topk(5, sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod))`Drill 13: Disk usage percentage [I]¶
Question: What percentage of PVC capacity is used for grokdevops?
Answer
`kubelet_volume_stats_used_bytes{namespace="grokdevops"} / kubelet_volume_stats_capacity_bytes{namespace="grokdevops"} * 100`Drill 14: Node memory available [E]¶
Question: How much memory (GB) is available on each node?
Answer
`node_memory_MemAvailable_bytes / 1024 / 1024 / 1024`Drill 15: Deployment replica mismatch [I]¶
Question: Find deployments where desired replicas don't match ready replicas.
Answer
`kube_deployment_spec_replicas{namespace="grokdevops"} != kube_deployment_status_ready_replicas{namespace="grokdevops"}`Drill 16: CPU throttling [H]¶
Question: Which pods are being CPU throttled?
Answer
`sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="grokdevops"}[5m])) by (pod) > 0`Drill 17: Request rate change [H]¶
Question: Show how much the request rate has changed compared to 1 hour ago (as a ratio).
Answer
`sum(rate(http_requests_total{job="grokdevops"}[5m])) / sum(rate(http_requests_total{job="grokdevops"}[5m] offset 1h))`Drill 18: Saturation - CPU requests vs capacity [H]¶
Question: What percentage of a pod's CPU request is actually being used?
Answer
`sum(rate(container_cpu_usage_seconds_total{namespace="grokdevops"}[5m])) by (pod) / sum(kube_pod_container_resource_requests{namespace="grokdevops",resource="cpu"}) by (pod) * 100`Drill 19: HPA current vs desired [I]¶
Question: Compare HPA current replicas to desired replicas.
Answer
`kube_horizontalpodautoscaler_status_current_replicas{namespace="grokdevops"} != kube_horizontalpodautoscaler_status_desired_replicas{namespace="grokdevops"}`Drill 20: Container OOMKills [I]¶
Question: Count OOMKill events in the last 24 hours.
Answer
`increase(kube_pod_container_status_last_terminated_reason{reason="OOMKilled",namespace="grokdevops"}[24h])`Drill 21: Network receive rate [I]¶
Question: Show network bytes received per second for grokdevops pods.
Answer
`sum(rate(container_network_receive_bytes_total{namespace="grokdevops"}[5m])) by (pod)`Drill 22: Slow endpoints [H]¶
Question: Which endpoints have p99 latency above 500ms?
Answer
`histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le, handler)) > 0.5`Drill 23: Error budget burn rate [H]¶
Question: For a 99.9% SLO, what is the current error budget burn rate?
Answer
`(sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[1h])) / sum(rate(http_requests_total{job="grokdevops"}[1h]))) / 0.001`Drill 24: Predict disk full [H]¶
Question: When will a PVC run out of space at the current fill rate?
Answer
`kubelet_volume_stats_available_bytes{namespace="grokdevops"} / deriv(kubelet_volume_stats_used_bytes{namespace="grokdevops"}[1h]) / 3600`Drill 25: Multi-metric alert condition [H]¶
Question: Write a query that fires when error rate is above 5% AND request rate is above 10 RPS (to avoid false positives on low traffic).
Answer
`(sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="grokdevops"}[5m])) > 0.05) and (sum(rate(http_requests_total{job="grokdevops"}[5m])) > 10)`Wiki Navigation¶
Prerequisites¶
- Observability Deep Dive (Topic Pack, L2)
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
- Alerting Rules (Topic Pack, L2) — Prometheus
- Alerting Rules Drills (Drill, L2) — Prometheus
- Capacity Planning (Topic Pack, L2) — Prometheus
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Prometheus
- Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Prometheus
- Datadog Flashcards (CLI) (flashcard_deck, L1) — Prometheus
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Prometheus
- Interview: Prometheus Target Down (Scenario, L2) — Prometheus
- Lab: Prometheus Target Down (CLI) (Lab, L2) — Prometheus