Decision Tree: Alert Fired — Is This Real?¶

Category: Incident Triage Starting Question: "An alert fired — is this a real incident or noise?" Estimated traversal: 2-4 minutes Domains: observability, kubernetes, linux-performance

The Tree¶

An alert fired — is this a real incident or noise?
│
├── What is the alert severity?
│   │
│   ├── Critical / Page
│   │   └── Treat as real until proven otherwise.
│   │       Acknowledge within SLA, then validate → continue below
│   │
│   ├── Warning
│   │   └── Validate before escalating → continue below
│   │
│   └── Info / Watchdog
│       └── Informational — log it, no action needed unless part of a pattern
│
├── Is the metric value actually above threshold right now?
│   (Go directly to the metric in Prometheus/Grafana — do not trust the alert label alone)
│   │
│   ├── Look up the metric: copy the PromQL from the alert rule and run it
│   │   `kubectl exec -n monitoring deploy/prometheus -- \`
│   │   `  promtool query instant http://localhost:9090 '<promql-from-rule>'`
│   │   │
│   │   ├── Value is below threshold → alert is already recovering
│   │   │   │
│   │   │   ├── Was it a spike? Check 30-min graph
│   │   │   │   └── Yes, brief spike → ✅ ACTION: Snooze and Investigate Threshold
│   │   │   │
│   │   │   └── Value never rose (stale firing alert)?
│   │   │       `kubectl get prometheusrule -A | grep <alert-name>`
│   │   │       └── ✅ ACTION: Check Alertmanager for Stale Firing State
│   │   │
│   │   └── Value is above threshold → alert is real → continue
│   │
├── How long has the metric been above threshold?
│   (Check the Grafana time series — zoom out to last 2 hours)
│   │
│   ├── < 5 minutes (very recent)
│   │   └── Watch it for 2 more minutes before full incident declaration
│   │       `watch -n 15 'kubectl top pods -n <namespace>'`
│   │
│   ├── 5-30 minutes (sustained)
│   │   └── Real issue — proceed to impact assessment below
│   │
│   └── >30 minutes (long-duration)
│       └── Likely already user-impacting — escalate immediately, investigate in parallel
│           → ⚠️ ESCALATION: On-Call Engineer
│
├── Is this a systemic issue? (Are multiple alerts firing simultaneously?)
│   Check Alertmanager: `kubectl exec -n monitoring deploy/alertmanager -- \`
│   `  amtool alert query --alertmanager.url=http://localhost:9093`
│   │
│   ├── Yes — many alerts across multiple services/nodes
│   │   │
│   │   ├── Is there a recent change? (deployment, config, infra)
│   │   │   `kubectl get events --all-namespaces --sort-by=.lastTimestamp | tail -30`
│   │   │   └── Yes → likely one root cause — treat as major incident
│   │   │       → ⚠️ ESCALATION: Incident Commander
│   │   │
│   │   └── No recent change → external cause (upstream dependency, cloud provider)
│   │       Check cloud provider status page + your dependency status pages
│   │       → ⚠️ ESCALATION: Incident Commander + vendor contact
│   │
│   └── No — isolated single alert → likely a specific service issue
│       → continue to user impact check
│
├── Is the symptom user-visible?
│   │
│   ├── Check synthetic monitoring / uptime probes
│   │   `kubectl get probe -A` (Prometheus Blackbox)
│   │   or check your external uptime monitor dashboard
│   │   │
│   │   ├── External probes showing errors → user-impacting → declare incident
│   │   │   → ⚠️ ESCALATION: On-Call + Stakeholder Notification
│   │   │
│   │   └── External probes healthy → internal issue not yet user-facing
│   │       Log it, start investigating, set watchdog
│   │
│   └── No uptime monitoring available
│       Check error rate in metrics: `rate(http_requests_total{status=~"5.."}[5m])`
│       └── Error rate elevated → treat as user-impacting
│
├── Is this alert known to be flapping?
│   Look at alert history: Grafana → Alerting → Alert History for this rule
│   │
│   ├── Fired >3 times in 24h with short durations
│   │   │
│   │   ├── Threshold is too tight for normal variance
│   │   │   └── ✅ ACTION: Fix Alert Threshold / Add for duration
│   │   │
│   │   └── Underlying issue is real but intermittent
│   │       └── ✅ ACTION: Fix Root Cause / Add Pending Duration to Alert
│   │
│   └── First time firing → not a flapping issue
│
└── Is this during a maintenance window or scheduled event?
    Check your change calendar / on-call handoff notes
    │
    ├── Yes — alert is expected → ✅ ACTION: Silence Alert for Maintenance Window
    │
    └── No — not expected → investigate normally

Node Details¶

Check 1: Verify the metric value directly¶

Command: In Prometheus UI: navigate to Graph, paste the PromQL from the alert rule, click Execute. Or via CLI: kubectl exec -n monitoring deploy/prometheus -- wget -qO- 'http://localhost:9090/api/v1/query?query=<encoded-promql>' | jq '.data.result' What you're looking for: The actual current value, not just the alert label. An alert may still be "firing" after the metric recovered if Alertmanager hasn't received a resolution yet. Common pitfall: Alertmanager continues to show an alert as firing for up to resolve_timeout (default: 5 min) after the metric drops below threshold. Do not assume the metric is still high just because the alert is still showing.

Check 2: Multiple alerts firing simultaneously¶

Command: kubectl exec -n monitoring deploy/alertmanager -- amtool alert query --alertmanager.url=http://localhost:9093 | head -40 or use the Alertmanager UI directly. What you're looking for: Alerts from multiple different services or nodes firing within the same 5-minute window. This pattern indicates a shared dependency or infrastructure event rather than an isolated service bug. Common pitfall: Alert deduplication hides grouped alerts. In Alertmanager UI, expand all alert groups before concluding "only one alert is firing".

Check 3: Alert history / flapping¶

Command: In Grafana: Alerting → Alert History (search for rule name). In Prometheus: ALERTS{alertname="<name>"} — this shows currently firing alerts. For history, check your Alertmanager webhook receiver logs or long-term alerting storage. What you're looking for: Patterns like "fires for 2 min, recovers, fires again" on a daily cycle. This is textbook threshold-too-tight behavior. Common pitfall: An alert that fires briefly every morning at 9am is not random — it correlates with daily traffic peaks. Adjust the threshold or add for: 10m to require sustained breaches.

Check 4: Is symptom user-visible?¶

Command: Blackbox exporter probes: kubectl get servicemonitor -n monitoring | grep blackbox. Check probe_success metric: probe_success{job="blackbox"}. External check: curl -w "%{http_code}" -o /dev/null -s https://<your-service>/healthz. What you're looking for: probe_success = 0 or HTTP status != 200 from external probes confirms user-facing impact. Common pitfall: Internal healthcheck endpoints often pass even when the service is degraded (they don't test end-to-end functionality). A service returning 200 on /healthz can still be returning 500 on /api/v1/orders.

Check 5: Recent changes¶

Command: kubectl get events --all-namespaces --sort-by=.lastTimestamp | tail -20 and kubectl rollout history deployment --all-namespaces | grep -v "<none>". Also check your CI/CD system for recent deploys. What you're looking for: Deployments, ConfigMap changes, certificate rotations, HPA scaling events in the last 30-60 minutes that coincide with the alert onset. Common pitfall: Helm upgrades that only change values (not image) don't show up in kubectl rollout history. Check helm history <release> as well.

Terminal Actions¶

Action: Acknowledge and Investigate¶

Do: 1. Acknowledge the alert in PagerDuty / OpsGenie / Alertmanager to silence further pages 2. Open the alert's runbook link (every critical alert should have one) 3. Collect initial data: time of onset, affected services, current metric value 4. Post in incident channel: "Investigating [alert name] — [current status]" Verify: Alert acknowledged. Investigation underway with initial findings posted.

Action: Snooze and Investigate Threshold¶

Do: 1. In Alertmanager UI: create a silence for this alert for 24h to prevent repeat pages 2. kubectl annotate --overwrite prometheusrule <rule-name> -n monitoring note="threshold under review" 3. Check if for: <duration> is set in the alert rule — adding for: 5m eliminates spikes 4. Review normal metric variance over 7 days to set a better threshold Verify: Alert no longer fires for brief spikes. Document the change in your runbook. Runbook: prometheus_target_down.md

Action: Fix Alert Threshold / Add for Duration¶

Do: 1. kubectl get prometheusrule <name> -n monitoring -o yaml > /tmp/rule-backup.yaml 2. Edit the rule to add or increase for: 5m (require sustained breach, not just a spike) 3. Or adjust the threshold expression: > 0.95 → > 0.98 4. kubectl apply -f /tmp/rule-backup.yaml 5. Monitor for 24h to confirm it doesn't fire spuriously Verify: Alert does not fire on brief spikes. Still fires on sustained issues.

Action: Fix Root Cause / Add Pending Duration¶

Do: 1. Investigate the underlying intermittent issue (check logs, events, metrics around each firing) 2. Fix the root cause (connection leak, retry storm, etc.) — this is the priority 3. As interim measure: for: 10m in alert rule to require sustained breach before paging Verify: Alert stops flapping. Root cause no longer produces the metric spikes.

Action: Silence Alert for Maintenance Window¶

Do: 1. In Alertmanager UI: Silences → New Silence 2. Set matchers: alertname=<name>, and optionally scope to specific labels 3. Set start/end time to cover the maintenance window + 15 min buffer 4. Add comment: "Maintenance window — [change ticket ID]" Verify: Alert fires in Prometheus but is silenced in Alertmanager. Remove silence after maintenance.

Action: Check Alertmanager for Stale Firing State¶

Do: 1. kubectl exec -n monitoring deploy/alertmanager -- amtool alert query --alertmanager.url=http://localhost:9093 --filter='alertname=<name>' 2. If the alert is listed but metric is below threshold, it will auto-resolve after resolve_timeout 3. To force resolution: delete the silence or wait; do not manually delete alerts Verify: Alert clears in Alertmanager within 5 minutes of metric recovery.

Escalation: On-Call Engineer¶

When: Alert is sustained >10 min, metric is above threshold, and symptom may be user-facing. Who: On-call SRE per the rotation in PagerDuty/OpsGenie Include in page: Alert name, duration, current metric value, whether user-facing probes are affected, last deployment time

Escalation: Incident Commander¶

When: Multiple services alerting simultaneously, user impact confirmed, no obvious single root cause. Who: Incident commander (senior on-call or on-call manager) Include in page: List of all firing alerts, onset time, recent changes (deployments, infra), whether cloud provider status shows issues

Edge Cases¶

Alert fires in test/staging but not production: Thresholds may be configured identically but traffic patterns differ. Lower environments may fire on trivial load. Scope alert labels to env.
Alert fires every day at the same time: Traffic-correlated threshold. Either raise the threshold, switch to relative alerts (percent change), or use time-of-day based alerting inhibition.
Prometheus target is down, but service is up: Scrape endpoint may be broken (wrong port, broken /metrics handler). The alert may be a monitoring gap, not a service failure. Check up metric: up{job="<svc>"}.
Alertmanager shows 0 alerts but Prometheus shows ALERTS firing: Alertmanager and Prometheus may be misconfigured to not communicate. Check kubectl logs -n monitoring deploy/alertmanager for "no Prometheus connected" errors.
Alert inhibited by another rule: Alertmanager inhibition rules can hide related alerts during major incidents. Check kubectl get secret alertmanager-config -n monitoring -o yaml for inhibit_rules.

Cross-References¶

Topic Packs: observability-deep-dive, prometheus-deep-dive, alerting-rules, k8s-ops
Runbooks: prometheus_target_down.md, loki_no_logs.md