Solution: Lab Runtime 03 -- Observability Target Down¶
SPOILER WARNING: Try to solve it yourself first. Use hints progressively.
Hint Ladder¶
Hint 1: Prometheus discovers targets through ServiceMonitor resources. If a target is down, the ServiceMonitor may be misconfigured.
Hint 2: Check the ServiceMonitor's label selector. Compare it to the actual labels on the grokdevops Service.
Hint 3: Run kubectl get servicemonitor -n monitoring -o yaml | grep -A5 selector and kubectl get svc grokdevops -n grokdevops --show-labels. Do the labels match?
Hint 4: The break script modified the ServiceMonitor selector to use non-matching labels. Restore the correct selector or run ./fix.sh.
Minimal Solution¶
# Identify the mismatch
kubectl get servicemonitor -n monitoring -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.selector.matchLabels}{"\n"}{end}'
kubectl get svc grokdevops -n grokdevops --show-labels
# Fix by restoring correct labels (or run ./fix.sh)
./fix.sh
Explain¶
Symptom: Prometheus targets page shows the grokdevops target as DOWN or missing entirely.
Evidence: ServiceMonitor's spec.selector.matchLabels doesn't match the Service's labels. Prometheus operator uses these labels to generate scrape configs.
Root cause: Prometheus uses the ServiceMonitor CRD to discover scrape targets. The ServiceMonitor specifies a label selector that must match a Service. When labels don't match, Prometheus never generates a scrape config for that target, so it appears as missing/down.
Key insight: There are two label matching steps: (1) Prometheus operator must find the ServiceMonitor (via serviceMonitorSelector on the Prometheus CR), and (2) the ServiceMonitor must find the Service (via spec.selector.matchLabels). Either can fail.
Prevent¶
- Use
helm templateto verify ServiceMonitor selectors before deploying - Add alerting rule for when expected targets disappear
- Keep ServiceMonitor in the same Helm chart as the app so labels stay in sync