Portal | Level: L2: Operations | Topics: Prometheus | Domain: Observability

PromQL Drills¶

25 drills for Prometheus query language muscle memory. Each should take 1-5 minutes.

Difficulty: [E] Easy = single function | [I] Intermediate = multiple functions or by-clause | [H] Hard = multi-step or nested

Remember: The PromQL function hierarchy: rate() for counters (per-second average), increase() for counters (total increase over range), avg/sum/max/min for aggregation, histogram_quantile() for percentiles. Mnemonic: "Rate for speed, Increase for total, Histogram for percentiles." Never use rate() on a gauge — it is only for monotonically increasing counters.

Gotcha: rate() requires a range vector (e.g., [5m]). The range should be at least 4x the scrape interval. With a 15s scrape interval, use rate(metric[1m]) minimum. Too-short ranges produce noisy, inaccurate results because rate needs at least two data points.

Drill 1: Is the target up? [E]¶

Question: Check if the grokdevops target is being scraped successfully.

# Your query here

Answer

`up{job="grokdevops"}` > **Under the hood:** `up` is a synthetic metric Prometheus creates for every scrape target. `1` = last scrape succeeded, `0` = last scrape failed. It is the simplest health check available and the first thing to verify when a target seems silent.

Drill 2: Total requests in last hour [E]¶

Question: How many total HTTP requests has grokdevops handled in the last hour?

# Your query here

Answer

`increase(http_requests_total{job="grokdevops"}[1h])` > **Gotcha:** `increase()` is syntactic sugar for `rate() * seconds_in_range`. On counters that reset (e.g., pod restart), `increase()` handles the reset automatically. But if the counter resets multiple times within the range, some counts may be lost between resets.

Drill 3: Request rate per second [E]¶

Question: What is the current request rate (RPS) for grokdevops?

# Your query here

Answer

`sum(rate(http_requests_total{job="grokdevops"}[5m]))`

Drill 4: Request rate by status code [I]¶

Question: Break down the request rate by HTTP status code.

# Your query here

Answer

`sum(rate(http_requests_total{job="grokdevops"}[5m])) by (status)`

Drill 5: Error rate percentage [I]¶

Question: What percentage of requests are returning 5xx errors?

# Your query here

Answer

`sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="grokdevops"}[5m])) * 100`

Drill 6: p99 latency [I]¶

Question: What is the 99th percentile response time?

# Your query here

Answer

`histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le))`

Drill 7: p50 latency [I]¶

Question: What is the median (50th percentile) response time?

# Your query here

Answer

`histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le))`

Drill 8: CPU usage by pod [I]¶

Question: Show CPU usage (in cores) for each pod in the grokdevops namespace.

# Your query here

Answer

`sum(rate(container_cpu_usage_seconds_total{namespace="grokdevops"}[5m])) by (pod)`

Drill 9: Memory usage in MB [E]¶

Question: Show memory usage in megabytes for grokdevops pods.

# Your query here

Answer

`sum(container_memory_working_set_bytes{namespace="grokdevops"}) by (pod) / 1024 / 1024`

Drill 10: Pod restart count [E]¶

Question: How many times has each pod restarted in the last hour?

# Your query here

Answer

`increase(kube_pod_container_status_restarts_total{namespace="grokdevops"}[1h])`

Drill 11: Target disappeared [I]¶

Question: Write a query that fires when the grokdevops target stops being scraped entirely.

# Your query here

Answer

`absent(up{job="grokdevops"})`

Drill 12: Top 5 CPU consumers [I]¶

Question: Show the top 5 pods by CPU usage across all namespaces.

# Your query here

Answer

`topk(5, sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod))`

Drill 13: Disk usage percentage [I]¶

Question: What percentage of PVC capacity is used for grokdevops?

# Your query here

Answer

`kubelet_volume_stats_used_bytes{namespace="grokdevops"} / kubelet_volume_stats_capacity_bytes{namespace="grokdevops"} * 100`

Drill 14: Node memory available [E]¶

Question: How much memory (GB) is available on each node?

# Your query here

Answer

`node_memory_MemAvailable_bytes / 1024 / 1024 / 1024`

Drill 15: Deployment replica mismatch [I]¶

Question: Find deployments where desired replicas don't match ready replicas.

# Your query here

Answer

`kube_deployment_spec_replicas{namespace="grokdevops"} != kube_deployment_status_ready_replicas{namespace="grokdevops"}`

Drill 16: CPU throttling [H]¶

Question: Which pods are being CPU throttled?

# Your query here

Answer

`sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="grokdevops"}[5m])) by (pod) > 0`

Drill 17: Request rate change [H]¶

Question: Show how much the request rate has changed compared to 1 hour ago (as a ratio).

# Your query here

Answer

`sum(rate(http_requests_total{job="grokdevops"}[5m])) / sum(rate(http_requests_total{job="grokdevops"}[5m] offset 1h))`

Drill 18: Saturation - CPU requests vs capacity [H]¶

Question: What percentage of a pod's CPU request is actually being used?

# Your query here

Answer

`sum(rate(container_cpu_usage_seconds_total{namespace="grokdevops"}[5m])) by (pod) / sum(kube_pod_container_resource_requests{namespace="grokdevops",resource="cpu"}) by (pod) * 100`

Drill 19: HPA current vs desired [I]¶

Question: Compare HPA current replicas to desired replicas.

# Your query here

Answer

`kube_horizontalpodautoscaler_status_current_replicas{namespace="grokdevops"} != kube_horizontalpodautoscaler_status_desired_replicas{namespace="grokdevops"}`

Drill 20: Container OOMKills [I]¶

Question: Count OOMKill events in the last 24 hours.

# Your query here

Answer

`increase(kube_pod_container_status_last_terminated_reason{reason="OOMKilled",namespace="grokdevops"}[24h])`

Drill 21: Network receive rate [I]¶

Question: Show network bytes received per second for grokdevops pods.

# Your query here

Answer

`sum(rate(container_network_receive_bytes_total{namespace="grokdevops"}[5m])) by (pod)`

Drill 22: Slow endpoints [H]¶

Question: Which endpoints have p99 latency above 500ms?

# Your query here

Answer

`histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le, handler)) > 0.5`

Drill 23: Error budget burn rate [H]¶

Question: For a 99.9% SLO, what is the current error budget burn rate?

# Your query here

Answer

`(sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[1h])) / sum(rate(http_requests_total{job="grokdevops"}[1h]))) / 0.001`

Drill 24: Predict disk full [H]¶

Question: When will a PVC run out of space at the current fill rate?

# Your query here

Answer

`kubelet_volume_stats_available_bytes{namespace="grokdevops"} / deriv(kubelet_volume_stats_used_bytes{namespace="grokdevops"}[1h]) / 3600`

Drill 25: Multi-metric alert condition [H]¶

Question: Write a query that fires when error rate is above 5% AND request rate is above 10 RPS (to avoid false positives on low traffic).

# Your query here

Answer

`(sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="grokdevops"}[5m])) > 0.05) and (sum(rate(http_requests_total{job="grokdevops"}[5m])) > 10)`

Prerequisites¶

Observability Deep Dive (Topic Pack, L2)

Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
Alerting Rules (Topic Pack, L2) — Prometheus
Alerting Rules Drills (Drill, L2) — Prometheus
Capacity Planning (Topic Pack, L2) — Prometheus
Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Prometheus
Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Prometheus
Datadog Flashcards (CLI) (flashcard_deck, L1) — Prometheus
Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Prometheus
Interview: Prometheus Target Down (Scenario, L2) — Prometheus
Lab: Prometheus Target Down (CLI) (Lab, L2) — Prometheus

PromQL Drills¶

Drill 1: Is the target up? [E]¶

Drill 2: Total requests in last hour [E]¶

Drill 3: Request rate per second [E]¶

Drill 4: Request rate by status code [I]¶

Drill 5: Error rate percentage [I]¶

Drill 6: p99 latency [I]¶

Drill 7: p50 latency [I]¶

Drill 8: CPU usage by pod [I]¶

Drill 9: Memory usage in MB [E]¶

Drill 10: Pod restart count [E]¶

Drill 11: Target disappeared [I]¶

Drill 12: Top 5 CPU consumers [I]¶

Drill 13: Disk usage percentage [I]¶

Drill 14: Node memory available [E]¶

Drill 15: Deployment replica mismatch [I]¶

Drill 16: CPU throttling [H]¶

Drill 17: Request rate change [H]¶

Drill 18: Saturation - CPU requests vs capacity [H]¶

Drill 19: HPA current vs desired [I]¶

Drill 20: Container OOMKills [I]¶

Drill 21: Network receive rate [I]¶

Drill 22: Slow endpoints [H]¶

Drill 23: Error budget burn rate [H]¶

Drill 24: Predict disk full [H]¶

Drill 25: Multi-metric alert condition [H]¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

PromQL Drills¶

Drill 1: Is the target up? [E]¶

Drill 2: Total requests in last hour [E]¶

Drill 3: Request rate per second [E]¶

Drill 4: Request rate by status code [I]¶

Drill 5: Error rate percentage [I]¶

Drill 6: p99 latency [I]¶

Drill 7: p50 latency [I]¶

Drill 8: CPU usage by pod [I]¶

Drill 9: Memory usage in MB [E]¶

Drill 10: Pod restart count [E]¶

Drill 11: Target disappeared [I]¶

Drill 12: Top 5 CPU consumers [I]¶

Drill 13: Disk usage percentage [I]¶

Drill 14: Node memory available [E]¶

Drill 15: Deployment replica mismatch [I]¶

Drill 16: CPU throttling [H]¶

Drill 17: Request rate change [H]¶

Drill 18: Saturation - CPU requests vs capacity [H]¶

Drill 19: HPA current vs desired [I]¶

Drill 20: Container OOMKills [I]¶

Drill 21: Network receive rate [I]¶

Drill 22: Slow endpoints [H]¶

Drill 23: Error budget burn rate [H]¶

Drill 24: Predict disk full [H]¶

Drill 25: Multi-metric alert condition [H]¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶