On-Call Survival: Observability¶
Print this. Pin it. Read it at 3 AM.
Alert: Prometheus Scrape Target Down¶
Severity: P2 (single target) / P1 (all targets)
First command:
# Open Prometheus UI → Status → Targets OR:
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Then visit http://localhost:9090/targets
DOWN; the error message next to them (connection refused, 404, timeout).
Decision tree:
Is the target pod running?
├── No → Fix the pod first (see Kubernetes guide).
└── Yes → Is it "connection refused"?
├── Yes → Is the metrics port actually open?
│ kubectl exec -n <ns> <pod> -- wget -O- http://localhost:<port>/metrics | head -5
│ Not open? → App misconfigured; escalate to app team.
└── No → Is it 403/401?
├── Yes → Authentication config mismatch in Prometheus scrape config.
│ kubectl get secret prometheus-scrape-token -n monitoring (or equivalent)
└── No → Is it a network policy blocking Prometheus?
kubectl get netpol -n <ns> | grep -i prom
Escalate: "Prometheus cannot reach target <pod>, error: <paste>"
Escalation trigger: All targets down (Prometheus itself broken); more than 50% of targets down; SLO breach due to missing metrics.
Safe actions: Check target status in UI, kubectl exec to test metrics endpoint — read-only.
Dangerous actions: Edit Prometheus scrape config (may silence all alerts), restart Prometheus (gap in data).
Alert: Alert Storm (> 20 alerts firing simultaneously)¶
Severity: P1
First command:
# Alertmanager UI → Alerts (grouped by alertname and severity)
kubectl port-forward -n monitoring svc/alertmanager 9093:9093
# Visit http://localhost:9093
Decision tree:
Are all alerts from a single alertname (e.g., InstanceDown)?
├── Yes → Single rule, multiple instances. Is it a real outage or a flapping check?
│ Check if instances are actually down: kubectl get pods -A | grep -v Running
│ Real outage → Treat as cluster-level incident; escalate to on-call lead.
└── No → Are they all from one service's metrics being missing?
├── Yes → Scrape target down? See "Prometheus Scrape Target Down" above.
└── No → Cascade of real alerts from a shared dependency?
(e.g., database down → all services failing)
Identify root cause service, silence dependent alerts temporarily:
amtool silence add --alertmanager.url=http://localhost:9093 \
alertname=~"DependentAlert" --comment="Investigating root cause" --duration=30m
Escalate: "Alert storm from <root-cause>; silenced dependents for 30m"
Escalation trigger: Alert storm is masking a real P1 outage; Alertmanager itself down; PagerDuty flood preventing real escalation.
Safe actions: View alerts in UI, identify patterns, check scrape targets.
Dangerous actions: Silence alerts (amtool silence — may hide real problems), inhibit rules, restart Alertmanager.
Alert: Grafana Dashboard Showing No Data / Blank Panels¶
Severity: P2
First command:
What you're looking for: Grafana pod status. If running, the issue is data source or query.Decision tree:
Is Grafana pod running?
├── No → kubectl describe pod <grafana-pod> -n monitoring; restart if crashed.
└── Yes → Is the Prometheus data source reachable?
│ (Grafana UI → Configuration → Data Sources → Test)
├── Data source error → Is Prometheus running?
│ kubectl get pods -n monitoring | grep prometheus
│ Prometheus down? → kubectl rollout restart deploy/prometheus -n monitoring
└── Data source OK → Is the time range correct?
├── Dashboard showing future or wrong time? → Fix browser timezone / Grafana time picker.
└── Query returning no data → Check PromQL in panel: is the metric name correct?
Prometheus UI → Graph → run the query manually.
Escalate if valid metric returns no data: "Metric <name> missing since <time>"
Escalation trigger: Prometheus down and cannot be restarted; persistent storage (Grafana SQLite/Postgres) lost; all dashboards blank during active incident.
Safe actions: Check pod status, test data source connection, run PromQL queries manually.
Dangerous actions: Restart Prometheus (data gap), delete Grafana persistent data, change retention settings.
Alert: Loki / Log Pipeline Gap¶
Severity: P2
First command:
What you're looking for: Are Loki and log shippers (Promtail/Alloy) running?Decision tree:
Is Loki running?
├── No → kubectl describe pod <loki-pod> -n monitoring; check for OOM or storage issues.
│ kubectl rollout restart deploy/loki -n monitoring (if no data loss risk)
└── Yes → Is Promtail/Alloy running on each node?
├── No (DaemonSet pods missing) → kubectl get ds -n monitoring; kubectl rollout restart ds/promtail -n monitoring
└── Yes → Are logs being ingested? (Loki UI → Explore → {job="<name>"} — last entry?)
Recent gap? → Check Promtail logs: kubectl logs -n monitoring -l app=promtail --tail=50
Permission error on /var/log? → Check DaemonSet volume mounts.
Escalate: "No logs in Loki for <service> since <time>; Promtail logs: <paste>"
Escalation trigger: Loki storage full (disk or object store quota); log shipper crash-looping on all nodes; persistent log gap during active security incident.
Safe actions: Check pod status, read Promtail logs, query Loki for recent entries.
Dangerous actions: Restart Loki (brief query gap), compact Loki chunks, change retention.
Alert: Alertmanager Not Sending Notifications¶
Severity: P1 (people not getting paged)
First command:
What you're looking for: Errors sending to PagerDuty/Slack —connection refused, 401 unauthorized, integration key invalid.
Decision tree:
Are there auth errors in logs (401, invalid key)?
├── Yes → Check the integration key/token in alertmanager secret:
│ kubectl get secret alertmanager-config -n monitoring -o yaml | grep -i key
│ Rotate key if compromised. Update secret. Rollout restart alertmanager.
└── No → Is it a network error (cannot reach PagerDuty/Slack URL)?
├── Yes → Cluster egress blocked? Test: kubectl exec -n monitoring <pod> -- curl -I https://events.pagerduty.com
│ Network policy? Proxy required? Escalate to infra.
└── No → Is the config YAML malformed?
amtool config check --alertmanager.url=http://localhost:9093
Invalid config? → Fix and rollout restart alertmanager.
Escalate: "Alertmanager not paging, logs: <paste>"
Escalation trigger: Nobody is getting paged for active P1 incidents; Alertmanager crash-looping; cannot reach notification endpoints.
Safe actions: Read Alertmanager logs, test config with amtool config check.
Dangerous actions: Edit Alertmanager config (can silence all notifications), restart Alertmanager.
Quick Reference¶
Most Useful Commands¶
# Prometheus targets (check for DOWN)
kubectl port-forward -n monitoring svc/prometheus 9090:9090 &
# → http://localhost:9090/targets
# Alertmanager alerts
kubectl port-forward -n monitoring svc/alertmanager 9093:9093 &
# → http://localhost:9093
# All monitoring stack pods
kubectl get pods -n monitoring
# Alertmanager logs
kubectl logs -n monitoring -l app=alertmanager --tail=100
# Prometheus logs
kubectl logs -n monitoring -l app=prometheus --tail=100
# Promtail logs (log shipper)
kubectl logs -n monitoring -l app=promtail --tail=50
# Test a PromQL query
curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=up{job="grokdevops"}'
# Add a silence (30 min)
amtool silence add --alertmanager.url=http://localhost:9093 \
alertname="<name>" --comment="investigating" --duration=30m
# List active silences
amtool silence --alertmanager.url=http://localhost:9093
# Restart monitoring components (careful — brief gap)
kubectl rollout restart deploy/prometheus -n monitoring
kubectl rollout restart deploy/grafana -n monitoring
kubectl rollout restart deploy/alertmanager -n monitoring
Escalation Contacts¶
| Situation | Team | Channel |
|---|---|---|
| Prometheus storage full | Infra / Platform | #infra-oncall |
| Nobody getting paged | On-call lead immediately | Direct page |
| Alert storm, P1 cascade | On-call lead | #incidents |
| Loki data loss | Platform | #infra-oncall |
Safe vs Dangerous Actions¶
| Safe (do without asking) | Dangerous (get approval) |
|---|---|
| Read logs from monitoring pods | Restart Prometheus |
| View targets, alerts in UI | Silence alerts |
| Run PromQL queries | Edit Alertmanager config |
| Test data source connection | Delete Grafana state |
| List active silences | Change retention settings |