- observability
- l1
- runbook
- prometheus --- Portal | Level: L1: Foundations | Topics: Prometheus | Domain: Observability
Runbook: Prometheus Target Down¶
| Field | Value |
|---|---|
| Domain | Observability |
| Alert | up == 0 for any scrape target for >2 minutes |
| Severity | P2 |
| Est. Resolution Time | 15-30 minutes |
| Escalation Timeout | 30 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | kubectl access, Prometheus UI access (port-forward if needed), access to the scraped service's namespace |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 &
# Then open http://localhost:9090/targets in your browser
DOWN with a specific error message → Continue to Step 1 with that target in focus
If output shows: All or many targets are DOWN simultaneously → This is likely a Prometheus configuration or RBAC problem, skip to Step 5
Step 1: Identify Which Targets Are Down and Read the Error¶
Why: Prometheus shows the exact scrape error for each down target. Reading it carefully will often tell you the cause immediately — "connection refused", "context deadline exceeded", and "no route to host" each point to different root causes.
# Check the targets page for error messages
# In the Prometheus UI: Status > Targets (http://localhost:9090/targets)
# Filter by "unhealthy" to see only failing targets
# Alternatively, query Prometheus directly for down targets
curl -s 'http://localhost:9090/api/v1/query?query=up==0' | python3 -m json.tool | grep -A5 '"metric"'
{
"metric": {
"__name__": "up",
"instance": "10.0.1.45:8080",
"job": "myapp",
"namespace": "production"
},
"value": [1710000000, "0"]
}
instance (host:port), job, and namespace — you will need these in subsequent steps.
If this fails: Port-forward may have died. Re-run kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 and try again.
Step 2: Check If the Target Pod/Service Still Exists¶
Why: The most common cause of a target going down is that the pod was deleted or restarted. If the pod is gone, Prometheus cannot scrape it regardless of any other configuration.
# Replace <NAMESPACE> and <TARGET_LABELS> with values from Step 1
kubectl get pod,svc -n <NAMESPACE> -l <TARGET_LABELS>
# If you are not sure of the labels, describe the ServiceMonitor or check the Prometheus target URL
kubectl get pods -n <NAMESPACE> -o wide | grep <TARGET_HOST_FROM_STEP1>
NAME READY STATUS RESTARTS AGE
pod/myapp-6d8b9c7f4-xkp2m 1/1 Running 0 3h
service/myapp ClusterIP 10.96.1.50 <none> 8080/TCP 3h
CrashLoopBackOff or Error state, the problem is the application itself — see crashloopbackoff.md.
Step 3: Test Whether the Metrics Endpoint Is Actually Reachable¶
Why: Even if the pod is running, the /metrics endpoint might be unreachable due to a port misconfiguration, a failed exporter, or the application not having started its metrics server yet.
# Port-forward directly to the target pod and test the endpoint
kubectl port-forward -n <NAMESPACE> pod/<POD_NAME> <LOCAL_PORT>:<METRICS_PORT> &
curl -s http://localhost:<LOCAL_PORT>/metrics | head -10
# Alternatively, test from inside the cluster using a debug pod
kubectl run debug-curl --image=curlimages/curl:latest --rm -it --restart=Never \
-- curl -s http://<TARGET_HOST>:<METRICS_PORT>/metrics | head -10
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 4.9351e-05
...
curl returns connection refused, the application's metrics server is not listening on the expected port. Check the application configuration and the containerPort defined in the pod spec.
Step 4: Check the ServiceMonitor or PodMonitor Selector Configuration¶
Why: Prometheus discovers targets through ServiceMonitor or PodMonitor resources. If the label selectors on the monitor do not match the labels on the service or pod, Prometheus never discovers the target and it silently disappears from the scrape list.
# Find the ServiceMonitor for this job
kubectl get servicemonitor -n <NAMESPACE>
# Inspect the selector — does it match the labels on the service?
kubectl get servicemonitor -n <NAMESPACE> <MONITOR_NAME> -o yaml
# Compare against the actual service labels
kubectl get svc -n <NAMESPACE> <SERVICE_NAME> --show-labels
# ServiceMonitor spec.selector.matchLabels:
# app: myapp
# Service labels:
# app: myapp ← must match
Step 5: Check Prometheus RBAC — Can It Access the Target Namespace?¶
Why: Prometheus uses a ServiceAccount to list pods and services. If the ClusterRole does not include the target namespace, Prometheus cannot discover or scrape targets there.
# Check what ClusterRoleBindings exist for Prometheus
kubectl get clusterrolebinding | grep prometheus
# Inspect the ClusterRole to see what permissions are granted
kubectl describe clusterrole <PROMETHEUS_CLUSTERROLE_NAME>
# Verify the ServiceAccount Prometheus uses
kubectl get pod -n monitoring -l app=prometheus -o jsonpath='{.items[0].spec.serviceAccountName}'
Name: prometheus-k8s
PolicyRule:
Resources Non-Resource URLs Resource Names Verbs
--------- ----------------- -------------- -----
endpoints/... [] [] [get list watch]
nodes/... [] [] [get list watch]
pods/... [] [] [get list watch]
services/... [] [] [get list watch]
endpoints, pods, or services, Prometheus cannot discover scrape targets. File a change request to update the RBAC — do not modify ClusterRoles without change management.
Step 6: Check Network Policies Blocking the Scrape Port¶
Why: If a NetworkPolicy restricts ingress to the target pod, Prometheus cannot reach it even if discovery works correctly. This is a common oversight when hardening namespaces.
# List NetworkPolicies in the target namespace
kubectl get networkpolicy -n <NAMESPACE>
# Check if any policy restricts ingress on the metrics port
kubectl describe networkpolicy -n <NAMESPACE>
# Specifically, check if traffic from the monitoring namespace is allowed
kubectl get networkpolicy -n <NAMESPACE> -o yaml | grep -A20 'ingress'
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: monitoring
ports:
- port: 8080
protocol: TCP
monitoring namespace, add one. Coordinate with the security team to ensure the change is reviewed.
Verification¶
# Confirm the target is back up in Prometheus
curl -s 'http://localhost:9090/api/v1/query?query=up%7Bjob%3D"<JOB_NAME>"%7D' | python3 -m json.tool | grep '"value"'
"value": [..., "1"] — up == 1.
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 30 min | Platform / Observability team | "Prometheus target |
| Data loss suspected | Observability lead | "Metrics gap for |
| Scope expanding | Platform team | "Multiple Prometheus targets down across namespaces — possible Prometheus RBAC revocation or network-level event" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- Add a
PrometheusTargetMissingalert (distinct fromup == 0) to catch targets that disappear entirely from the scrape list - Validate ServiceMonitor selectors in CI/CD — a label typo should fail before deployment
- Document which namespaces require NetworkPolicy adjustments for monitoring access
Common Mistakes¶
- Checking the wrong namespace: Prometheus may be in
monitoringbut the target is inproduction. Always check the namespace shown in the alert/target label, not where Prometheus lives. - Label selectors on the ServiceMonitor not matching pod labels: The selector must match the Service labels, not the Pod labels directly (for ServiceMonitor). Check both the Service and the ServiceMonitor to confirm alignment.
- Ignoring the scrape error message in the Prometheus targets page: The error message usually tells you exactly what is wrong ("connection refused" = app not listening, "context deadline exceeded" = network/firewall issue, "403 Forbidden" = RBAC issue). Read it before investigating anything else.
- Assuming the target is down when it may have just been replaced: If a pod was restarted, its IP address changes. Prometheus updates automatically within the next scrape interval. Wait 30-60 seconds before declaring it a problem.
Cross-References¶
- Topic Pack: Prometheus and ServiceMonitor Configuration (deep background on service discovery, scrape configs, and RBAC)
- Related Runbook: grafana-blank.md — missing metrics surface in Grafana as blank panels
- Related Runbook: alert-storm.md — many targets going down simultaneously triggers an alert storm
- Troubleshooting Guide:
training/library/guides/troubleshooting.md(ServiceMonitor section) - Observability Guide:
training/library/guides/observability.md - Lab:
training/interactive/runtime-labs/lab-runtime-03-observability-target-down/ - Interview Scenario:
training/interview-scenarios/03-prometheus-target-down.md - Incident Scenarios:
training/interactive/incidents/scenarios/prometheus-target-down.sh,training/interactive/incidents/scenarios/obs-target-down.sh
Wiki Navigation¶
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
- Alerting Rules (Topic Pack, L2) — Prometheus
- Alerting Rules Drills (Drill, L2) — Prometheus
- Capacity Planning (Topic Pack, L2) — Prometheus
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Prometheus
- Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Prometheus
- Datadog Flashcards (CLI) (flashcard_deck, L1) — Prometheus
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Prometheus
- Interview: Prometheus Target Down (Scenario, L2) — Prometheus
- Lab: Prometheus Target Down (CLI) (Lab, L2) — Prometheus
Pages that link here¶
- Decision Tree: Alert Fired — Is This Real?
- Level 4: Operations & Observability
- Observability
- Observability Domain
- Operational Runbooks
- Primer
- Runbook: Alert Storm (Flapping / Too Many Alerts)
- Runbook: Grafana Dashboard Blank / No Data
- Runbook: Log Pipeline Backpressure / Logs Not Appearing
- Scenario: Prometheus Says Target Down
- Solution: Lab Runtime 03 -- Observability Target Down