On-Call Survival: Observability¶

Print this. Pin it. Read it at 3 AM.

Alert: Prometheus Scrape Target Down¶

Severity: P2 (single target) / P1 (all targets)

First command:

# Open Prometheus UI → Status → Targets  OR:
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Then visit http://localhost:9090/targets

What you're looking for: Which targets are DOWN; the error message next to them (connection refused, 404, timeout).

Decision tree:

Is the target pod running?
├── No → Fix the pod first (see Kubernetes guide).
└── Yes → Is it "connection refused"?
    ├── Yes → Is the metrics port actually open?
    │         kubectl exec -n <ns> <pod> -- wget -O- http://localhost:<port>/metrics | head -5
    │         Not open? → App misconfigured; escalate to app team.
    └── No → Is it 403/401?
        ├── Yes → Authentication config mismatch in Prometheus scrape config.
        │         kubectl get secret prometheus-scrape-token -n monitoring (or equivalent)
        └── No → Is it a network policy blocking Prometheus?
                 kubectl get netpol -n <ns> | grep -i prom
                 Escalate: "Prometheus cannot reach target <pod>, error: <paste>"

Escalation trigger: All targets down (Prometheus itself broken); more than 50% of targets down; SLO breach due to missing metrics.

Safe actions: Check target status in UI, kubectl exec to test metrics endpoint — read-only.

Dangerous actions: Edit Prometheus scrape config (may silence all alerts), restart Prometheus (gap in data).

Alert: Alert Storm (> 20 alerts firing simultaneously)¶

Severity: P1

First command:

# Alertmanager UI → Alerts (grouped by alertname and severity)
kubectl port-forward -n monitoring svc/alertmanager 9093:9093
# Visit http://localhost:9093

What you're looking for: Which alertname is generating the most instances. Is it one rule with many labels, or many distinct problems?

Decision tree:

Are all alerts from a single alertname (e.g., InstanceDown)?
├── Yes → Single rule, multiple instances. Is it a real outage or a flapping check?
│         Check if instances are actually down: kubectl get pods -A | grep -v Running
│         Real outage → Treat as cluster-level incident; escalate to on-call lead.
└── No → Are they all from one service's metrics being missing?
    ├── Yes → Scrape target down? See "Prometheus Scrape Target Down" above.
    └── No → Cascade of real alerts from a shared dependency?
             (e.g., database down → all services failing)
             Identify root cause service, silence dependent alerts temporarily:
             amtool silence add --alertmanager.url=http://localhost:9093 \
               alertname=~"DependentAlert" --comment="Investigating root cause" --duration=30m
             Escalate: "Alert storm from <root-cause>; silenced dependents for 30m"

Escalation trigger: Alert storm is masking a real P1 outage; Alertmanager itself down; PagerDuty flood preventing real escalation.

Safe actions: View alerts in UI, identify patterns, check scrape targets.

Dangerous actions: Silence alerts (amtool silence — may hide real problems), inhibit rules, restart Alertmanager.

Alert: Grafana Dashboard Showing No Data / Blank Panels¶

Severity: P2

First command:

kubectl get pods -n monitoring | grep grafana

What you're looking for: Grafana pod status. If running, the issue is data source or query.

Decision tree:

Is Grafana pod running?
├── No → kubectl describe pod <grafana-pod> -n monitoring; restart if crashed.
└── Yes → Is the Prometheus data source reachable?
    │     (Grafana UI → Configuration → Data Sources → Test)
    ├── Data source error → Is Prometheus running?
    │   kubectl get pods -n monitoring | grep prometheus
    │   Prometheus down? → kubectl rollout restart deploy/prometheus -n monitoring
    └── Data source OK → Is the time range correct?
        ├── Dashboard showing future or wrong time? → Fix browser timezone / Grafana time picker.
        └── Query returning no data → Check PromQL in panel: is the metric name correct?
                                       Prometheus UI → Graph → run the query manually.
                                       Escalate if valid metric returns no data: "Metric <name> missing since <time>"

Escalation trigger: Prometheus down and cannot be restarted; persistent storage (Grafana SQLite/Postgres) lost; all dashboards blank during active incident.

Safe actions: Check pod status, test data source connection, run PromQL queries manually.

Dangerous actions: Restart Prometheus (data gap), delete Grafana persistent data, change retention settings.

Alert: Loki / Log Pipeline Gap¶

Severity: P2

First command:

kubectl get pods -n monitoring | grep -E "loki|promtail|alloy|agent"

What you're looking for: Are Loki and log shippers (Promtail/Alloy) running?

Decision tree:

Is Loki running?
├── No → kubectl describe pod <loki-pod> -n monitoring; check for OOM or storage issues.
│        kubectl rollout restart deploy/loki -n monitoring (if no data loss risk)
└── Yes → Is Promtail/Alloy running on each node?
    ├── No (DaemonSet pods missing) → kubectl get ds -n monitoring; kubectl rollout restart ds/promtail -n monitoring
    └── Yes → Are logs being ingested? (Loki UI → Explore → {job="<name>"} — last entry?)
              Recent gap? → Check Promtail logs: kubectl logs -n monitoring -l app=promtail --tail=50
              Permission error on /var/log? → Check DaemonSet volume mounts.
              Escalate: "No logs in Loki for <service> since <time>; Promtail logs: <paste>"

Escalation trigger: Loki storage full (disk or object store quota); log shipper crash-looping on all nodes; persistent log gap during active security incident.

Safe actions: Check pod status, read Promtail logs, query Loki for recent entries.

Dangerous actions: Restart Loki (brief query gap), compact Loki chunks, change retention.

Alert: Alertmanager Not Sending Notifications¶

Severity: P1 (people not getting paged)

First command:

kubectl logs -n monitoring -l app=alertmanager --tail=50

What you're looking for: Errors sending to PagerDuty/Slack — connection refused, 401 unauthorized, integration key invalid.

Decision tree:

Are there auth errors in logs (401, invalid key)?
├── Yes → Check the integration key/token in alertmanager secret:
│         kubectl get secret alertmanager-config -n monitoring -o yaml | grep -i key
│         Rotate key if compromised. Update secret. Rollout restart alertmanager.
└── No → Is it a network error (cannot reach PagerDuty/Slack URL)?
    ├── Yes → Cluster egress blocked? Test: kubectl exec -n monitoring <pod> -- curl -I https://events.pagerduty.com
    │         Network policy? Proxy required? Escalate to infra.
    └── No → Is the config YAML malformed?
             amtool config check --alertmanager.url=http://localhost:9093
             Invalid config? → Fix and rollout restart alertmanager.
             Escalate: "Alertmanager not paging, logs: <paste>"

Escalation trigger: Nobody is getting paged for active P1 incidents; Alertmanager crash-looping; cannot reach notification endpoints.

Safe actions: Read Alertmanager logs, test config with amtool config check.

Dangerous actions: Edit Alertmanager config (can silence all notifications), restart Alertmanager.

Quick Reference¶

Most Useful Commands¶

# Prometheus targets (check for DOWN)
kubectl port-forward -n monitoring svc/prometheus 9090:9090 &
# → http://localhost:9090/targets

# Alertmanager alerts
kubectl port-forward -n monitoring svc/alertmanager 9093:9093 &
# → http://localhost:9093

# All monitoring stack pods
kubectl get pods -n monitoring

# Alertmanager logs
kubectl logs -n monitoring -l app=alertmanager --tail=100

# Prometheus logs
kubectl logs -n monitoring -l app=prometheus --tail=100

# Promtail logs (log shipper)
kubectl logs -n monitoring -l app=promtail --tail=50

# Test a PromQL query
curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=up{job="grokdevops"}'

# Add a silence (30 min)
amtool silence add --alertmanager.url=http://localhost:9093 \
  alertname="<name>" --comment="investigating" --duration=30m

# List active silences
amtool silence --alertmanager.url=http://localhost:9093

# Restart monitoring components (careful — brief gap)
kubectl rollout restart deploy/prometheus -n monitoring
kubectl rollout restart deploy/grafana -n monitoring
kubectl rollout restart deploy/alertmanager -n monitoring

Escalation Contacts¶

Situation	Team	Channel
Prometheus storage full	Infra / Platform	#infra-oncall
Nobody getting paged	On-call lead immediately	Direct page
Alert storm, P1 cascade	On-call lead	#incidents
Loki data loss	Platform	#infra-oncall

Safe vs Dangerous Actions¶

Safe (do without asking)	Dangerous (get approval)
Read logs from monitoring pods	Restart Prometheus
View targets, alerts in UI	Silence alerts
Run PromQL queries	Edit Alertmanager config
Test data source connection	Delete Grafana state
List active silences	Change retention settings

Shift Handoff Template¶

Status: [GREEN/YELLOW/RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]