Portal | Level: L2: Operations | Topics: Prometheus, Loki | Domain: Observability
Observability Drills¶
Remember: The observability debugging flow: Alert fires -> check Grafana dashboard (what metric is off?) -> check Prometheus (what changed?) -> check Loki logs (why did it change?) -> check Tempo traces (where in the request path?). Each tool answers a different question. Jumping straight to logs without checking metrics first wastes time on red herrings.
Gotcha: If a Prometheus target shows as DOWN, the problem is usually label selector mismatch between the
ServiceMonitorand the actualService. Runkubectl get servicemonitor -o yamland compare itsselector.matchLabelsto the Service'smetadata.labels— they must match exactly. A single typo keeps the target permanently DOWN with no obvious error message.
15 drills for Prometheus, Loki, Grafana, and Tempo operations. Each takes 1-5 minutes.
Difficulty: [E] Easy (recall) | [I] Intermediate (combine flags/tools) | [H] Hard (multi-step debugging)
Drill 1: Check Prometheus targets [I]¶
Question: List all Prometheus scrape targets and find which ones are DOWN.
Relevant runbook:training/library/runbooks/prometheus_target_down.md
Answer: answers/obs_answers.md
Drill 2: Check if metrics-server is running [E]¶
Question: Verify that the metrics-server pod is running and the metrics API is available.
Relevant runbook:training/library/runbooks/kubernetes/hpa_not_scaling.md
Answer: answers/obs_answers.md
Drill 3: Port-forward to Grafana [E]¶
Question: Forward local port 3000 to the Grafana service in the monitoring namespace.
Answer: answers/obs_answers.mdDrill 4: Check Promtail pods [E]¶
Question: List all Promtail pods. Verify they are running on every node.
Relevant runbook:training/library/runbooks/observability/loki_no_logs.md
Answer: answers/obs_answers.md
Drill 5: Query Prometheus directly [I]¶
Question: Port-forward to Prometheus and query the up metric to see which targets are healthy.
Drill 6: Check ServiceMonitor labels [I]¶
Question: Verify that the grokdevops ServiceMonitor's selector matches the actual service labels.
Relevant runbook:training/library/runbooks/prometheus_target_down.md
Answer: answers/obs_answers.md
Drill 7: Find Loki data source [I]¶
Question: Check if Loki is configured as a data source in Grafana (via CLI or API).
Answer: answers/obs_answers.mdDrill 8: Check Promtail config [I]¶
Question: View the Promtail configuration to see which log paths it's scraping.
Answer: answers/obs_answers.mdDrill 9: Check Tempo pods [E]¶
Question: Verify that Tempo is running and accessible in the monitoring namespace.
Relevant runbook:training/library/runbooks/observability/tempo_no_traces.md
Answer: answers/obs_answers.md
Drill 10: Identify why a target is down [H]¶
Question: A Prometheus target shows as DOWN. Find the ServiceMonitor and compare its selector to the service's labels.
Relevant lab:training/interactive/runtime-labs/lab-runtime-03-observability-target-down/
Answer: answers/obs_answers.md
Drill 11: Check Prometheus rules [E]¶
Question: List all PrometheusRule resources in the monitoring namespace.
Answer: answers/obs_answers.mdDrill 12: View Prometheus config [I]¶
Question: Check the Prometheus configuration to see scrape intervals and rule files.
Answer: answers/obs_answers.mdDrill 13: Check metrics endpoint [E]¶
Question: Verify that the grokdevops app exposes a /metrics endpoint.
Drill 14: View Grafana dashboards list [I]¶
Question: List available Grafana dashboards via the API.
Answer: answers/obs_answers.mdDrill 15: Full observability health check [H]¶
Question: In one sequence, verify that Prometheus, Loki, Promtail, Tempo, and Grafana are all running.
Answer: answers/obs_answers.mdWiki Navigation¶
Related Content¶
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Loki, Prometheus
- Observability Architecture (Reference, L2) — Loki, Prometheus
- Observability Deep Dive (Topic Pack, L2) — Loki, Prometheus
- Skillcheck: Observability (Assessment, L2) — Loki, Prometheus
- Track: Observability (Reference, L2) — Loki, Prometheus
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
- Alerting Rules (Topic Pack, L2) — Prometheus
- Alerting Rules Drills (Drill, L2) — Prometheus
- Capacity Planning (Topic Pack, L2) — Prometheus
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Prometheus