observability
l1
runbook
prometheus --- Portal | Level: L1: Foundations | Topics: Prometheus | Domain: Observability

Runbook: Prometheus Target Down¶

Field	Value
Domain	Observability
Alert	`up == 0` for any scrape target for >2 minutes
Severity	P2
Est. Resolution Time	15-30 minutes
Escalation Timeout	30 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	kubectl access, Prometheus UI access (port-forward if needed), access to the scraped service's namespace

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 &
# Then open http://localhost:9090/targets in your browser

If output shows: One or a few targets in state DOWN with a specific error message → Continue to Step 1 with that target in focus If output shows: All or many targets are DOWN simultaneously → This is likely a Prometheus configuration or RBAC problem, skip to Step 5

Step 1: Identify Which Targets Are Down and Read the Error¶

Why: Prometheus shows the exact scrape error for each down target. Reading it carefully will often tell you the cause immediately — "connection refused", "context deadline exceeded", and "no route to host" each point to different root causes.

# Check the targets page for error messages
# In the Prometheus UI: Status > Targets (http://localhost:9090/targets)
# Filter by "unhealthy" to see only failing targets

# Alternatively, query Prometheus directly for down targets
curl -s 'http://localhost:9090/api/v1/query?query=up==0' | python3 -m json.tool | grep -A5 '"metric"'

Expected output:

{
  "metric": {
    "__name__": "up",
    "instance": "10.0.1.45:8080",
    "job": "myapp",
    "namespace": "production"
  },
  "value": [1710000000, "0"]
}

Note the instance (host:port), job, and namespace — you will need these in subsequent steps. If this fails: Port-forward may have died. Re-run kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 and try again.

Step 2: Check If the Target Pod/Service Still Exists¶

Why: The most common cause of a target going down is that the pod was deleted or restarted. If the pod is gone, Prometheus cannot scrape it regardless of any other configuration.

# Replace <NAMESPACE> and <TARGET_LABELS> with values from Step 1
kubectl get pod,svc -n <NAMESPACE> -l <TARGET_LABELS>

# If you are not sure of the labels, describe the ServiceMonitor or check the Prometheus target URL
kubectl get pods -n <NAMESPACE> -o wide | grep <TARGET_HOST_FROM_STEP1>

Expected output:

NAME                        READY   STATUS    RESTARTS   AGE
pod/myapp-6d8b9c7f4-xkp2m   1/1     Running   0          3h
service/myapp               ClusterIP   10.96.1.50   <none>   8080/TCP   3h

If this fails: If the pod is in CrashLoopBackOff or Error state, the problem is the application itself — see crashloopbackoff.md.

Step 3: Test Whether the Metrics Endpoint Is Actually Reachable¶

Why: Even if the pod is running, the /metrics endpoint might be unreachable due to a port misconfiguration, a failed exporter, or the application not having started its metrics server yet.

# Port-forward directly to the target pod and test the endpoint
kubectl port-forward -n <NAMESPACE> pod/<POD_NAME> <LOCAL_PORT>:<METRICS_PORT> &
curl -s http://localhost:<LOCAL_PORT>/metrics | head -10

# Alternatively, test from inside the cluster using a debug pod
kubectl run debug-curl --image=curlimages/curl:latest --rm -it --restart=Never \
  -- curl -s http://<TARGET_HOST>:<METRICS_PORT>/metrics | head -10

Expected output:

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 4.9351e-05
...

If this fails: If curl returns connection refused, the application's metrics server is not listening on the expected port. Check the application configuration and the containerPort defined in the pod spec.

Step 4: Check the ServiceMonitor or PodMonitor Selector Configuration¶

Why: Prometheus discovers targets through ServiceMonitor or PodMonitor resources. If the label selectors on the monitor do not match the labels on the service or pod, Prometheus never discovers the target and it silently disappears from the scrape list.

# Find the ServiceMonitor for this job
kubectl get servicemonitor -n <NAMESPACE>

# Inspect the selector — does it match the labels on the service?
kubectl get servicemonitor -n <NAMESPACE> <MONITOR_NAME> -o yaml

# Compare against the actual service labels
kubectl get svc -n <NAMESPACE> <SERVICE_NAME> --show-labels

Expected output (matching — healthy):

# ServiceMonitor spec.selector.matchLabels:
#   app: myapp
# Service labels:
#   app: myapp   ← must match

If this fails: If the labels do not match, update either the ServiceMonitor selector or the service labels to align. Note: Prometheus operator watches for label changes — the fix takes effect within the next scrape interval (default 30s).

Step 5: Check Prometheus RBAC — Can It Access the Target Namespace?¶

Why: Prometheus uses a ServiceAccount to list pods and services. If the ClusterRole does not include the target namespace, Prometheus cannot discover or scrape targets there.

# Check what ClusterRoleBindings exist for Prometheus
kubectl get clusterrolebinding | grep prometheus

# Inspect the ClusterRole to see what permissions are granted
kubectl describe clusterrole <PROMETHEUS_CLUSTERROLE_NAME>

# Verify the ServiceAccount Prometheus uses
kubectl get pod -n monitoring -l app=prometheus -o jsonpath='{.items[0].spec.serviceAccountName}'

Expected output:

Name:         prometheus-k8s
PolicyRule:
  Resources                  Non-Resource URLs  Resource Names  Verbs
  ---------                  -----------------  --------------  -----
  endpoints/...              []                 []              [get list watch]
  nodes/...                  []                 []              [get list watch]
  pods/...                   []                 []              [get list watch]
  services/...               []                 []              [get list watch]

If this fails: If the ClusterRole does not include endpoints, pods, or services, Prometheus cannot discover scrape targets. File a change request to update the RBAC — do not modify ClusterRoles without change management.

Step 6: Check Network Policies Blocking the Scrape Port¶

Why: If a NetworkPolicy restricts ingress to the target pod, Prometheus cannot reach it even if discovery works correctly. This is a common oversight when hardening namespaces.

# List NetworkPolicies in the target namespace
kubectl get networkpolicy -n <NAMESPACE>

# Check if any policy restricts ingress on the metrics port
kubectl describe networkpolicy -n <NAMESPACE>

# Specifically, check if traffic from the monitoring namespace is allowed
kubectl get networkpolicy -n <NAMESPACE> -o yaml | grep -A20 'ingress'

Expected output (healthy — open to monitoring namespace):

ingress:
- from:
  - namespaceSelector:
      matchLabels:
        kubernetes.io/metadata.name: monitoring
  ports:
  - port: 8080
    protocol: TCP

If this fails: If no rule allows traffic from the monitoring namespace, add one. Coordinate with the security team to ensure the change is reviewed.

Verification¶

# Confirm the target is back up in Prometheus
curl -s 'http://localhost:9090/api/v1/query?query=up%7Bjob%3D"<JOB_NAME>"%7D' | python3 -m json.tool | grep '"value"'

Success looks like: The target returns "value": [..., "1"] — up == 1. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 30 min	Platform / Observability team	"Prometheus target has been down for >30 min; scrape error: ; investigated RBAC and network policies"
Data loss suspected	Observability lead	"Metrics gap for : target has been down since , recording rules may have gaps"
Scope expanding	Platform team	"Multiple Prometheus targets down across namespaces — possible Prometheus RBAC revocation or network-level event"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete
Add a PrometheusTargetMissing alert (distinct from up == 0) to catch targets that disappear entirely from the scrape list
Validate ServiceMonitor selectors in CI/CD — a label typo should fail before deployment
Document which namespaces require NetworkPolicy adjustments for monitoring access

Common Mistakes¶

Checking the wrong namespace: Prometheus may be in monitoring but the target is in production. Always check the namespace shown in the alert/target label, not where Prometheus lives.
Label selectors on the ServiceMonitor not matching pod labels: The selector must match the Service labels, not the Pod labels directly (for ServiceMonitor). Check both the Service and the ServiceMonitor to confirm alignment.
Ignoring the scrape error message in the Prometheus targets page: The error message usually tells you exactly what is wrong ("connection refused" = app not listening, "context deadline exceeded" = network/firewall issue, "403 Forbidden" = RBAC issue). Read it before investigating anything else.
Assuming the target is down when it may have just been replaced: If a pod was restarted, its IP address changes. Prometheus updates automatically within the next scrape interval. Wait 30-60 seconds before declaring it a problem.

Cross-References¶

Topic Pack: Prometheus and ServiceMonitor Configuration (deep background on service discovery, scrape configs, and RBAC)
Related Runbook: grafana-blank.md — missing metrics surface in Grafana as blank panels
Related Runbook: alert-storm.md — many targets going down simultaneously triggers an alert storm
Troubleshooting Guide: training/library/guides/troubleshooting.md (ServiceMonitor section)
Observability Guide: training/library/guides/observability.md
Lab: training/interactive/runtime-labs/lab-runtime-03-observability-target-down/
Interview Scenario: training/interview-scenarios/03-prometheus-target-down.md
Incident Scenarios: training/interactive/incidents/scenarios/prometheus-target-down.sh, training/interactive/incidents/scenarios/obs-target-down.sh

Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
Alerting Rules (Topic Pack, L2) — Prometheus
Alerting Rules Drills (Drill, L2) — Prometheus
Capacity Planning (Topic Pack, L2) — Prometheus
Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Prometheus
Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Prometheus
Datadog Flashcards (CLI) (flashcard_deck, L1) — Prometheus
Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Prometheus
Interview: Prometheus Target Down (Scenario, L2) — Prometheus
Lab: Prometheus Target Down (CLI) (Lab, L2) — Prometheus

Runbook: Prometheus Target Down¶

Quick Assessment (30 seconds)¶

Step 1: Identify Which Targets Are Down and Read the Error¶

Step 2: Check If the Target Pod/Service Still Exists¶

Step 3: Test Whether the Metrics Endpoint Is Actually Reachable¶

Step 4: Check the ServiceMonitor or PodMonitor Selector Configuration¶

Step 5: Check Prometheus RBAC — Can It Access the Target Namespace?¶

Step 6: Check Network Policies Blocking the Scrape Port¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: Prometheus Target Down¶

Quick Assessment (30 seconds)¶

Step 1: Identify Which Targets Are Down and Read the Error¶

Step 2: Check If the Target Pod/Service Still Exists¶

Step 3: Test Whether the Metrics Endpoint Is Actually Reachable¶

Step 4: Check the ServiceMonitor or PodMonitor Selector Configuration¶

Step 5: Check Prometheus RBAC — Can It Access the Target Namespace?¶

Step 6: Check Network Policies Blocking the Scrape Port¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶