Skip to content

Portal | Level: L2: Operations | Topics: Prometheus | Domain: Observability

Scenario: Prometheus Says Target Down

The Prompt

"Our Grafana dashboards suddenly show 'No data' for application metrics. Prometheus targets page shows our app target is missing entirely. The app is running fine — users can access it. What happened?"

Initial Report

Developer Slack message: "All our Grafana dashboards are blank since about 10:30 AM. The app is fine — users can log in — but we have zero visibility into metrics. We're flying blind."

Constraints

  • Time pressure: You have 15 minutes before the next escalation. Without metrics, the team cannot detect further issues.
  • Limited access: You have read access to the monitoring namespace but cannot restart Prometheus directly. Port-forwarding is available.

Observable Evidence

  • Dashboard: All application panels in Grafana show "No data". Infrastructure panels (node CPU, etc.) still work.
  • Prometheus /targets: The application scrape target is missing entirely from the targets list.
  • Logs: Prometheus logs show no errors related to scraping — the target simply is not in its configuration.

Expected Investigation Path

# 1. Confirm the app is running
kubectl get pods -n grokdevops
kubectl port-forward svc/grokdevops -n grokdevops 8000:80 &
curl http://localhost:8000/metrics

# 2. Check ServiceMonitor
kubectl get servicemonitor -n grokdevops
kubectl get servicemonitor grokdevops -n grokdevops -o yaml

# 3. Compare ServiceMonitor selector with service labels
kubectl get svc grokdevops -n grokdevops --show-labels

# 4. Check Prometheus config
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# → /targets, /config

Strong Answer

"If the app is healthy but Prometheus lost the target, the issue is in the service discovery chain, not the app itself. I'd check three things in order: First, the ServiceMonitor — does it still exist and does its selector.matchLabels match the service's labels? A label change during a Helm upgrade or manual edit could break the match. Second, I'd verify the service has endpoints — if pods aren't ready, the service has no IPs for Prometheus to scrape. Third, I'd check that Prometheus is configured to watch the app's namespace — with serviceMonitorSelectorNilUsesHelmValues=false, it watches all namespaces, but if that changed, it might not see our ServiceMonitor."

Common Traps

  • Assuming Prometheus is broken — the app works fine, it's the scrape config that's wrong
  • Not understanding ServiceMonitor — it's the bridge between Service and Prometheus
  • Forgetting about label selectors — the most common cause is a label mismatch
  • Ignoring the ~60s scrape interval — changes take a scrape cycle to reflect
  • Lab: training/interactive/runtime-labs/lab-runtime-03-observability-target-down/
  • Runbook: training/library/runbooks/prometheus_target_down.md
  • Quest: training/interactive/exercises/levels/level-50/k8s-monitoring/

Wiki Navigation