Portal | Level: L2: Operations | Topics: Prometheus | Domain: Observability
Scenario: Prometheus Says Target Down¶
The Prompt¶
"Our Grafana dashboards suddenly show 'No data' for application metrics. Prometheus targets page shows our app target is missing entirely. The app is running fine — users can access it. What happened?"
Initial Report¶
Developer Slack message: "All our Grafana dashboards are blank since about 10:30 AM. The app is fine — users can log in — but we have zero visibility into metrics. We're flying blind."
Constraints¶
- Time pressure: You have 15 minutes before the next escalation. Without metrics, the team cannot detect further issues.
- Limited access: You have read access to the monitoring namespace but cannot restart Prometheus directly. Port-forwarding is available.
Observable Evidence¶
- Dashboard: All application panels in Grafana show "No data". Infrastructure panels (node CPU, etc.) still work.
- Prometheus /targets: The application scrape target is missing entirely from the targets list.
- Logs: Prometheus logs show no errors related to scraping — the target simply is not in its configuration.
Expected Investigation Path¶
# 1. Confirm the app is running
kubectl get pods -n grokdevops
kubectl port-forward svc/grokdevops -n grokdevops 8000:80 &
curl http://localhost:8000/metrics
# 2. Check ServiceMonitor
kubectl get servicemonitor -n grokdevops
kubectl get servicemonitor grokdevops -n grokdevops -o yaml
# 3. Compare ServiceMonitor selector with service labels
kubectl get svc grokdevops -n grokdevops --show-labels
# 4. Check Prometheus config
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# → /targets, /config
Strong Answer¶
"If the app is healthy but Prometheus lost the target, the issue is in the service discovery chain, not the app itself. I'd check three things in order: First, the ServiceMonitor — does it still exist and does its selector.matchLabels match the service's labels? A label change during a Helm upgrade or manual edit could break the match. Second, I'd verify the service has endpoints — if pods aren't ready, the service has no IPs for Prometheus to scrape. Third, I'd check that Prometheus is configured to watch the app's namespace — with serviceMonitorSelectorNilUsesHelmValues=false, it watches all namespaces, but if that changed, it might not see our ServiceMonitor."
Common Traps¶
- Assuming Prometheus is broken — the app works fine, it's the scrape config that's wrong
- Not understanding ServiceMonitor — it's the bridge between Service and Prometheus
- Forgetting about label selectors — the most common cause is a label mismatch
- Ignoring the ~60s scrape interval — changes take a scrape cycle to reflect
Practice and Links¶
- Lab:
training/interactive/runtime-labs/lab-runtime-03-observability-target-down/ - Runbook:
training/library/runbooks/prometheus_target_down.md - Quest:
training/interactive/exercises/levels/level-50/k8s-monitoring/
Wiki Navigation¶
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
- Alerting Rules (Topic Pack, L2) — Prometheus
- Alerting Rules Drills (Drill, L2) — Prometheus
- Capacity Planning (Topic Pack, L2) — Prometheus
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Prometheus
- Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Prometheus
- Datadog Flashcards (CLI) (flashcard_deck, L1) — Prometheus
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Prometheus
- Lab: Prometheus Target Down (CLI) (Lab, L2) — Prometheus
- Monitoring Flashcards (CLI) (flashcard_deck, L1) — Prometheus
Pages that link here¶
- Alerting Rules - Skill Check
- Capacity Planning - Primer
- Interview Gauntlet: Alerts Firing but System Seems Fine
- Interview Gauntlet: Monitoring Stack from Scratch
- Interview Gauntlet: eBPF for Observability
- Interview Scenarios
- Level 5: SRE & Incident Response
- Log Analysis & Alerting Rules (PromQL / LogQL) - Primer
- Monitoring Migration (Legacy to Modern)
- Observability
- OpenTelemetry - Primer
- PromQL Drills
- Prometheus Deep Dive - Primer
- Runbook: Alert Storm (Flapping / Too Many Alerts)
- Runbook: Prometheus Target Down