Skip to content

Portal | Level: L2: Operations | Topics: Prometheus | Domain: Observability

PromQL Drills

25 drills for Prometheus query language muscle memory. Each should take 1-5 minutes.

Difficulty: [E] Easy = single function | [I] Intermediate = multiple functions or by-clause | [H] Hard = multi-step or nested

Remember: The PromQL function hierarchy: rate() for counters (per-second average), increase() for counters (total increase over range), avg/sum/max/min for aggregation, histogram_quantile() for percentiles. Mnemonic: "Rate for speed, Increase for total, Histogram for percentiles." Never use rate() on a gauge — it is only for monotonically increasing counters.

Gotcha: rate() requires a range vector (e.g., [5m]). The range should be at least 4x the scrape interval. With a 15s scrape interval, use rate(metric[1m]) minimum. Too-short ranges produce noisy, inaccurate results because rate needs at least two data points.


Drill 1: Is the target up? [E]

Question: Check if the grokdevops target is being scraped successfully.

# Your query here

Answer `up{job="grokdevops"}` > **Under the hood:** `up` is a synthetic metric Prometheus creates for every scrape target. `1` = last scrape succeeded, `0` = last scrape failed. It is the simplest health check available and the first thing to verify when a target seems silent.

Drill 2: Total requests in last hour [E]

Question: How many total HTTP requests has grokdevops handled in the last hour?

# Your query here

Answer `increase(http_requests_total{job="grokdevops"}[1h])` > **Gotcha:** `increase()` is syntactic sugar for `rate() * seconds_in_range`. On counters that reset (e.g., pod restart), `increase()` handles the reset automatically. But if the counter resets multiple times within the range, some counts may be lost between resets.

Drill 3: Request rate per second [E]

Question: What is the current request rate (RPS) for grokdevops?

# Your query here

Answer `sum(rate(http_requests_total{job="grokdevops"}[5m]))`

Drill 4: Request rate by status code [I]

Question: Break down the request rate by HTTP status code.

# Your query here

Answer `sum(rate(http_requests_total{job="grokdevops"}[5m])) by (status)`

Drill 5: Error rate percentage [I]

Question: What percentage of requests are returning 5xx errors?

# Your query here

Answer `sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="grokdevops"}[5m])) * 100`

Drill 6: p99 latency [I]

Question: What is the 99th percentile response time?

# Your query here

Answer `histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le))`

Drill 7: p50 latency [I]

Question: What is the median (50th percentile) response time?

# Your query here

Answer `histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le))`

Drill 8: CPU usage by pod [I]

Question: Show CPU usage (in cores) for each pod in the grokdevops namespace.

# Your query here

Answer `sum(rate(container_cpu_usage_seconds_total{namespace="grokdevops"}[5m])) by (pod)`

Drill 9: Memory usage in MB [E]

Question: Show memory usage in megabytes for grokdevops pods.

# Your query here

Answer `sum(container_memory_working_set_bytes{namespace="grokdevops"}) by (pod) / 1024 / 1024`

Drill 10: Pod restart count [E]

Question: How many times has each pod restarted in the last hour?

# Your query here

Answer `increase(kube_pod_container_status_restarts_total{namespace="grokdevops"}[1h])`

Drill 11: Target disappeared [I]

Question: Write a query that fires when the grokdevops target stops being scraped entirely.

# Your query here

Answer `absent(up{job="grokdevops"})`

Drill 12: Top 5 CPU consumers [I]

Question: Show the top 5 pods by CPU usage across all namespaces.

# Your query here

Answer `topk(5, sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod))`

Drill 13: Disk usage percentage [I]

Question: What percentage of PVC capacity is used for grokdevops?

# Your query here

Answer `kubelet_volume_stats_used_bytes{namespace="grokdevops"} / kubelet_volume_stats_capacity_bytes{namespace="grokdevops"} * 100`

Drill 14: Node memory available [E]

Question: How much memory (GB) is available on each node?

# Your query here

Answer `node_memory_MemAvailable_bytes / 1024 / 1024 / 1024`

Drill 15: Deployment replica mismatch [I]

Question: Find deployments where desired replicas don't match ready replicas.

# Your query here

Answer `kube_deployment_spec_replicas{namespace="grokdevops"} != kube_deployment_status_ready_replicas{namespace="grokdevops"}`

Drill 16: CPU throttling [H]

Question: Which pods are being CPU throttled?

# Your query here

Answer `sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="grokdevops"}[5m])) by (pod) > 0`

Drill 17: Request rate change [H]

Question: Show how much the request rate has changed compared to 1 hour ago (as a ratio).

# Your query here

Answer `sum(rate(http_requests_total{job="grokdevops"}[5m])) / sum(rate(http_requests_total{job="grokdevops"}[5m] offset 1h))`

Drill 18: Saturation - CPU requests vs capacity [H]

Question: What percentage of a pod's CPU request is actually being used?

# Your query here

Answer `sum(rate(container_cpu_usage_seconds_total{namespace="grokdevops"}[5m])) by (pod) / sum(kube_pod_container_resource_requests{namespace="grokdevops",resource="cpu"}) by (pod) * 100`

Drill 19: HPA current vs desired [I]

Question: Compare HPA current replicas to desired replicas.

# Your query here

Answer `kube_horizontalpodautoscaler_status_current_replicas{namespace="grokdevops"} != kube_horizontalpodautoscaler_status_desired_replicas{namespace="grokdevops"}`

Drill 20: Container OOMKills [I]

Question: Count OOMKill events in the last 24 hours.

# Your query here

Answer `increase(kube_pod_container_status_last_terminated_reason{reason="OOMKilled",namespace="grokdevops"}[24h])`

Drill 21: Network receive rate [I]

Question: Show network bytes received per second for grokdevops pods.

# Your query here

Answer `sum(rate(container_network_receive_bytes_total{namespace="grokdevops"}[5m])) by (pod)`

Drill 22: Slow endpoints [H]

Question: Which endpoints have p99 latency above 500ms?

# Your query here

Answer `histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="grokdevops"}[5m])) by (le, handler)) > 0.5`

Drill 23: Error budget burn rate [H]

Question: For a 99.9% SLO, what is the current error budget burn rate?

# Your query here

Answer `(sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[1h])) / sum(rate(http_requests_total{job="grokdevops"}[1h]))) / 0.001`

Drill 24: Predict disk full [H]

Question: When will a PVC run out of space at the current fill rate?

# Your query here

Answer `kubelet_volume_stats_available_bytes{namespace="grokdevops"} / deriv(kubelet_volume_stats_used_bytes{namespace="grokdevops"}[1h]) / 3600`

Drill 25: Multi-metric alert condition [H]

Question: Write a query that fires when error rate is above 5% AND request rate is above 10 RPS (to avoid false positives on low traffic).

# Your query here

Answer `(sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="grokdevops"}[5m])) > 0.05) and (sum(rate(http_requests_total{job="grokdevops"}[5m])) > 10)`

Wiki Navigation

Prerequisites