Skip to content

On-Call Survival: Observability

Print this. Pin it. Read it at 3 AM.


Alert: Prometheus Scrape Target Down

Severity: P2 (single target) / P1 (all targets)

First command:

# Open Prometheus UI → Status → Targets  OR:
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Then visit http://localhost:9090/targets
What you're looking for: Which targets are DOWN; the error message next to them (connection refused, 404, timeout).

Decision tree:

Is the target pod running?
├── No  Fix the pod first (see Kubernetes guide).
└── Yes  Is it "connection refused"?
    ├── Yes  Is the metrics port actually open?
             kubectl exec -n <ns> <pod> -- wget -O- http://localhost:<port>/metrics | head -5
             Not open?  App misconfigured; escalate to app team.
    └── No  Is it 403/401?
        ├── Yes  Authentication config mismatch in Prometheus scrape config.
                 kubectl get secret prometheus-scrape-token -n monitoring (or equivalent)
        └── No  Is it a network policy blocking Prometheus?
                 kubectl get netpol -n <ns> | grep -i prom
                 Escalate: "Prometheus cannot reach target <pod>, error: <paste>"

Escalation trigger: All targets down (Prometheus itself broken); more than 50% of targets down; SLO breach due to missing metrics.

Safe actions: Check target status in UI, kubectl exec to test metrics endpoint — read-only.

Dangerous actions: Edit Prometheus scrape config (may silence all alerts), restart Prometheus (gap in data).


Alert: Alert Storm (> 20 alerts firing simultaneously)

Severity: P1

First command:

# Alertmanager UI → Alerts (grouped by alertname and severity)
kubectl port-forward -n monitoring svc/alertmanager 9093:9093
# Visit http://localhost:9093
What you're looking for: Which alertname is generating the most instances. Is it one rule with many labels, or many distinct problems?

Decision tree:

Are all alerts from a single alertname (e.g., InstanceDown)?
├── Yes  Single rule, multiple instances. Is it a real outage or a flapping check?
         Check if instances are actually down: kubectl get pods -A | grep -v Running
         Real outage  Treat as cluster-level incident; escalate to on-call lead.
└── No  Are they all from one service's metrics being missing?
    ├── Yes  Scrape target down? See "Prometheus Scrape Target Down" above.
    └── No  Cascade of real alerts from a shared dependency?
             (e.g., database down  all services failing)
             Identify root cause service, silence dependent alerts temporarily:
             amtool silence add --alertmanager.url=http://localhost:9093 \
               alertname=~"DependentAlert" --comment="Investigating root cause" --duration=30m
             Escalate: "Alert storm from <root-cause>; silenced dependents for 30m"

Escalation trigger: Alert storm is masking a real P1 outage; Alertmanager itself down; PagerDuty flood preventing real escalation.

Safe actions: View alerts in UI, identify patterns, check scrape targets.

Dangerous actions: Silence alerts (amtool silence — may hide real problems), inhibit rules, restart Alertmanager.


Alert: Grafana Dashboard Showing No Data / Blank Panels

Severity: P2

First command:

kubectl get pods -n monitoring | grep grafana
What you're looking for: Grafana pod status. If running, the issue is data source or query.

Decision tree:

Is Grafana pod running?
├── No  kubectl describe pod <grafana-pod> -n monitoring; restart if crashed.
└── Yes  Is the Prometheus data source reachable?
         (Grafana UI  Configuration  Data Sources  Test)
    ├── Data source error  Is Prometheus running?
       kubectl get pods -n monitoring | grep prometheus
       Prometheus down?  kubectl rollout restart deploy/prometheus -n monitoring
    └── Data source OK  Is the time range correct?
        ├── Dashboard showing future or wrong time?  Fix browser timezone / Grafana time picker.
        └── Query returning no data  Check PromQL in panel: is the metric name correct?
                                       Prometheus UI  Graph  run the query manually.
                                       Escalate if valid metric returns no data: "Metric <name> missing since <time>"

Escalation trigger: Prometheus down and cannot be restarted; persistent storage (Grafana SQLite/Postgres) lost; all dashboards blank during active incident.

Safe actions: Check pod status, test data source connection, run PromQL queries manually.

Dangerous actions: Restart Prometheus (data gap), delete Grafana persistent data, change retention settings.


Alert: Loki / Log Pipeline Gap

Severity: P2

First command:

kubectl get pods -n monitoring | grep -E "loki|promtail|alloy|agent"
What you're looking for: Are Loki and log shippers (Promtail/Alloy) running?

Decision tree:

Is Loki running?
├── No  kubectl describe pod <loki-pod> -n monitoring; check for OOM or storage issues.
        kubectl rollout restart deploy/loki -n monitoring (if no data loss risk)
└── Yes  Is Promtail/Alloy running on each node?
    ├── No (DaemonSet pods missing)  kubectl get ds -n monitoring; kubectl rollout restart ds/promtail -n monitoring
    └── Yes  Are logs being ingested? (Loki UI  Explore  {job="<name>"}  last entry?)
              Recent gap?  Check Promtail logs: kubectl logs -n monitoring -l app=promtail --tail=50
              Permission error on /var/log?  Check DaemonSet volume mounts.
              Escalate: "No logs in Loki for <service> since <time>; Promtail logs: <paste>"

Escalation trigger: Loki storage full (disk or object store quota); log shipper crash-looping on all nodes; persistent log gap during active security incident.

Safe actions: Check pod status, read Promtail logs, query Loki for recent entries.

Dangerous actions: Restart Loki (brief query gap), compact Loki chunks, change retention.


Alert: Alertmanager Not Sending Notifications

Severity: P1 (people not getting paged)

First command:

kubectl logs -n monitoring -l app=alertmanager --tail=50
What you're looking for: Errors sending to PagerDuty/Slack — connection refused, 401 unauthorized, integration key invalid.

Decision tree:

Are there auth errors in logs (401, invalid key)?
├── Yes  Check the integration key/token in alertmanager secret:
         kubectl get secret alertmanager-config -n monitoring -o yaml | grep -i key
         Rotate key if compromised. Update secret. Rollout restart alertmanager.
└── No  Is it a network error (cannot reach PagerDuty/Slack URL)?
    ├── Yes  Cluster egress blocked? Test: kubectl exec -n monitoring <pod> -- curl -I https://events.pagerduty.com
             Network policy? Proxy required? Escalate to infra.
    └── No  Is the config YAML malformed?
             amtool config check --alertmanager.url=http://localhost:9093
             Invalid config?  Fix and rollout restart alertmanager.
             Escalate: "Alertmanager not paging, logs: <paste>"

Escalation trigger: Nobody is getting paged for active P1 incidents; Alertmanager crash-looping; cannot reach notification endpoints.

Safe actions: Read Alertmanager logs, test config with amtool config check.

Dangerous actions: Edit Alertmanager config (can silence all notifications), restart Alertmanager.


Quick Reference

Most Useful Commands

# Prometheus targets (check for DOWN)
kubectl port-forward -n monitoring svc/prometheus 9090:9090 &
# → http://localhost:9090/targets

# Alertmanager alerts
kubectl port-forward -n monitoring svc/alertmanager 9093:9093 &
# → http://localhost:9093

# All monitoring stack pods
kubectl get pods -n monitoring

# Alertmanager logs
kubectl logs -n monitoring -l app=alertmanager --tail=100

# Prometheus logs
kubectl logs -n monitoring -l app=prometheus --tail=100

# Promtail logs (log shipper)
kubectl logs -n monitoring -l app=promtail --tail=50

# Test a PromQL query
curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=up{job="grokdevops"}'

# Add a silence (30 min)
amtool silence add --alertmanager.url=http://localhost:9093 \
  alertname="<name>" --comment="investigating" --duration=30m

# List active silences
amtool silence --alertmanager.url=http://localhost:9093

# Restart monitoring components (careful — brief gap)
kubectl rollout restart deploy/prometheus -n monitoring
kubectl rollout restart deploy/grafana -n monitoring
kubectl rollout restart deploy/alertmanager -n monitoring

Escalation Contacts

Situation Team Channel
Prometheus storage full Infra / Platform #infra-oncall
Nobody getting paged On-call lead immediately Direct page
Alert storm, P1 cascade On-call lead #incidents
Loki data loss Platform #infra-oncall

Safe vs Dangerous Actions

Safe (do without asking) Dangerous (get approval)
Read logs from monitoring pods Restart Prometheus
View targets, alerts in UI Silence alerts
Run PromQL queries Edit Alertmanager config
Test data source connection Delete Grafana state
List active silences Change retention settings

Shift Handoff Template

Status: [GREEN/YELLOW/RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]