Decision Tree: Alert Fired — Is This Real?¶
Category: Incident Triage Starting Question: "An alert fired — is this a real incident or noise?" Estimated traversal: 2-4 minutes Domains: observability, kubernetes, linux-performance
The Tree¶
An alert fired — is this a real incident or noise?
│
├── What is the alert severity?
│ │
│ ├── Critical / Page
│ │ └── Treat as real until proven otherwise.
│ │ Acknowledge within SLA, then validate → continue below
│ │
│ ├── Warning
│ │ └── Validate before escalating → continue below
│ │
│ └── Info / Watchdog
│ └── Informational — log it, no action needed unless part of a pattern
│
├── Is the metric value actually above threshold right now?
│ (Go directly to the metric in Prometheus/Grafana — do not trust the alert label alone)
│ │
│ ├── Look up the metric: copy the PromQL from the alert rule and run it
│ │ `kubectl exec -n monitoring deploy/prometheus -- \`
│ │ ` promtool query instant http://localhost:9090 '<promql-from-rule>'`
│ │ │
│ │ ├── Value is below threshold → alert is already recovering
│ │ │ │
│ │ │ ├── Was it a spike? Check 30-min graph
│ │ │ │ └── Yes, brief spike → ✅ ACTION: Snooze and Investigate Threshold
│ │ │ │
│ │ │ └── Value never rose (stale firing alert)?
│ │ │ `kubectl get prometheusrule -A | grep <alert-name>`
│ │ │ └── ✅ ACTION: Check Alertmanager for Stale Firing State
│ │ │
│ │ └── Value is above threshold → alert is real → continue
│ │
├── How long has the metric been above threshold?
│ (Check the Grafana time series — zoom out to last 2 hours)
│ │
│ ├── < 5 minutes (very recent)
│ │ └── Watch it for 2 more minutes before full incident declaration
│ │ `watch -n 15 'kubectl top pods -n <namespace>'`
│ │
│ ├── 5-30 minutes (sustained)
│ │ └── Real issue — proceed to impact assessment below
│ │
│ └── >30 minutes (long-duration)
│ └── Likely already user-impacting — escalate immediately, investigate in parallel
│ → ⚠️ ESCALATION: On-Call Engineer
│
├── Is this a systemic issue? (Are multiple alerts firing simultaneously?)
│ Check Alertmanager: `kubectl exec -n monitoring deploy/alertmanager -- \`
│ ` amtool alert query --alertmanager.url=http://localhost:9093`
│ │
│ ├── Yes — many alerts across multiple services/nodes
│ │ │
│ │ ├── Is there a recent change? (deployment, config, infra)
│ │ │ `kubectl get events --all-namespaces --sort-by=.lastTimestamp | tail -30`
│ │ │ └── Yes → likely one root cause — treat as major incident
│ │ │ → ⚠️ ESCALATION: Incident Commander
│ │ │
│ │ └── No recent change → external cause (upstream dependency, cloud provider)
│ │ Check cloud provider status page + your dependency status pages
│ │ → ⚠️ ESCALATION: Incident Commander + vendor contact
│ │
│ └── No — isolated single alert → likely a specific service issue
│ → continue to user impact check
│
├── Is the symptom user-visible?
│ │
│ ├── Check synthetic monitoring / uptime probes
│ │ `kubectl get probe -A` (Prometheus Blackbox)
│ │ or check your external uptime monitor dashboard
│ │ │
│ │ ├── External probes showing errors → user-impacting → declare incident
│ │ │ → ⚠️ ESCALATION: On-Call + Stakeholder Notification
│ │ │
│ │ └── External probes healthy → internal issue not yet user-facing
│ │ Log it, start investigating, set watchdog
│ │
│ └── No uptime monitoring available
│ Check error rate in metrics: `rate(http_requests_total{status=~"5.."}[5m])`
│ └── Error rate elevated → treat as user-impacting
│
├── Is this alert known to be flapping?
│ Look at alert history: Grafana → Alerting → Alert History for this rule
│ │
│ ├── Fired >3 times in 24h with short durations
│ │ │
│ │ ├── Threshold is too tight for normal variance
│ │ │ └── ✅ ACTION: Fix Alert Threshold / Add for duration
│ │ │
│ │ └── Underlying issue is real but intermittent
│ │ └── ✅ ACTION: Fix Root Cause / Add Pending Duration to Alert
│ │
│ └── First time firing → not a flapping issue
│
└── Is this during a maintenance window or scheduled event?
Check your change calendar / on-call handoff notes
│
├── Yes — alert is expected → ✅ ACTION: Silence Alert for Maintenance Window
│
└── No — not expected → investigate normally
Node Details¶
Check 1: Verify the metric value directly¶
Command: In Prometheus UI: navigate to Graph, paste the PromQL from the alert rule, click Execute. Or via CLI: kubectl exec -n monitoring deploy/prometheus -- wget -qO- 'http://localhost:9090/api/v1/query?query=<encoded-promql>' | jq '.data.result'
What you're looking for: The actual current value, not just the alert label. An alert may still be "firing" after the metric recovered if Alertmanager hasn't received a resolution yet.
Common pitfall: Alertmanager continues to show an alert as firing for up to resolve_timeout (default: 5 min) after the metric drops below threshold. Do not assume the metric is still high just because the alert is still showing.
Check 2: Multiple alerts firing simultaneously¶
Command: kubectl exec -n monitoring deploy/alertmanager -- amtool alert query --alertmanager.url=http://localhost:9093 | head -40 or use the Alertmanager UI directly.
What you're looking for: Alerts from multiple different services or nodes firing within the same 5-minute window. This pattern indicates a shared dependency or infrastructure event rather than an isolated service bug.
Common pitfall: Alert deduplication hides grouped alerts. In Alertmanager UI, expand all alert groups before concluding "only one alert is firing".
Check 3: Alert history / flapping¶
Command: In Grafana: Alerting → Alert History (search for rule name). In Prometheus: ALERTS{alertname="<name>"} — this shows currently firing alerts. For history, check your Alertmanager webhook receiver logs or long-term alerting storage.
What you're looking for: Patterns like "fires for 2 min, recovers, fires again" on a daily cycle. This is textbook threshold-too-tight behavior.
Common pitfall: An alert that fires briefly every morning at 9am is not random — it correlates with daily traffic peaks. Adjust the threshold or add for: 10m to require sustained breaches.
Check 4: Is symptom user-visible?¶
Command: Blackbox exporter probes: kubectl get servicemonitor -n monitoring | grep blackbox. Check probe_success metric: probe_success{job="blackbox"}. External check: curl -w "%{http_code}" -o /dev/null -s https://<your-service>/healthz.
What you're looking for: probe_success = 0 or HTTP status != 200 from external probes confirms user-facing impact.
Common pitfall: Internal healthcheck endpoints often pass even when the service is degraded (they don't test end-to-end functionality). A service returning 200 on /healthz can still be returning 500 on /api/v1/orders.
Check 5: Recent changes¶
Command: kubectl get events --all-namespaces --sort-by=.lastTimestamp | tail -20 and kubectl rollout history deployment --all-namespaces | grep -v "<none>". Also check your CI/CD system for recent deploys.
What you're looking for: Deployments, ConfigMap changes, certificate rotations, HPA scaling events in the last 30-60 minutes that coincide with the alert onset.
Common pitfall: Helm upgrades that only change values (not image) don't show up in kubectl rollout history. Check helm history <release> as well.
Terminal Actions¶
Action: Acknowledge and Investigate¶
Do: 1. Acknowledge the alert in PagerDuty / OpsGenie / Alertmanager to silence further pages 2. Open the alert's runbook link (every critical alert should have one) 3. Collect initial data: time of onset, affected services, current metric value 4. Post in incident channel: "Investigating [alert name] — [current status]" Verify: Alert acknowledged. Investigation underway with initial findings posted.
Action: Snooze and Investigate Threshold¶
Do:
1. In Alertmanager UI: create a silence for this alert for 24h to prevent repeat pages
2. kubectl annotate --overwrite prometheusrule <rule-name> -n monitoring note="threshold under review"
3. Check if for: <duration> is set in the alert rule — adding for: 5m eliminates spikes
4. Review normal metric variance over 7 days to set a better threshold
Verify: Alert no longer fires for brief spikes. Document the change in your runbook.
Runbook: prometheus_target_down.md
Action: Fix Alert Threshold / Add for Duration¶
Do:
1. kubectl get prometheusrule <name> -n monitoring -o yaml > /tmp/rule-backup.yaml
2. Edit the rule to add or increase for: 5m (require sustained breach, not just a spike)
3. Or adjust the threshold expression: > 0.95 → > 0.98
4. kubectl apply -f /tmp/rule-backup.yaml
5. Monitor for 24h to confirm it doesn't fire spuriously
Verify: Alert does not fire on brief spikes. Still fires on sustained issues.
Action: Fix Root Cause / Add Pending Duration¶
Do:
1. Investigate the underlying intermittent issue (check logs, events, metrics around each firing)
2. Fix the root cause (connection leak, retry storm, etc.) — this is the priority
3. As interim measure: for: 10m in alert rule to require sustained breach before paging
Verify: Alert stops flapping. Root cause no longer produces the metric spikes.
Action: Silence Alert for Maintenance Window¶
Do:
1. In Alertmanager UI: Silences → New Silence
2. Set matchers: alertname=<name>, and optionally scope to specific labels
3. Set start/end time to cover the maintenance window + 15 min buffer
4. Add comment: "Maintenance window — [change ticket ID]"
Verify: Alert fires in Prometheus but is silenced in Alertmanager. Remove silence after maintenance.
Action: Check Alertmanager for Stale Firing State¶
Do:
1. kubectl exec -n monitoring deploy/alertmanager -- amtool alert query --alertmanager.url=http://localhost:9093 --filter='alertname=<name>'
2. If the alert is listed but metric is below threshold, it will auto-resolve after resolve_timeout
3. To force resolution: delete the silence or wait; do not manually delete alerts
Verify: Alert clears in Alertmanager within 5 minutes of metric recovery.
Escalation: On-Call Engineer¶
When: Alert is sustained >10 min, metric is above threshold, and symptom may be user-facing. Who: On-call SRE per the rotation in PagerDuty/OpsGenie Include in page: Alert name, duration, current metric value, whether user-facing probes are affected, last deployment time
Escalation: Incident Commander¶
When: Multiple services alerting simultaneously, user impact confirmed, no obvious single root cause. Who: Incident commander (senior on-call or on-call manager) Include in page: List of all firing alerts, onset time, recent changes (deployments, infra), whether cloud provider status shows issues
Edge Cases¶
- Alert fires in test/staging but not production: Thresholds may be configured identically but traffic patterns differ. Lower environments may fire on trivial load. Scope alert labels to env.
- Alert fires every day at the same time: Traffic-correlated threshold. Either raise the threshold, switch to relative alerts (percent change), or use time-of-day based alerting inhibition.
- Prometheus target is down, but service is up: Scrape endpoint may be broken (wrong port, broken
/metricshandler). The alert may be a monitoring gap, not a service failure. Checkupmetric:up{job="<svc>"}. - Alertmanager shows 0 alerts but Prometheus shows ALERTS firing: Alertmanager and Prometheus may be misconfigured to not communicate. Check
kubectl logs -n monitoring deploy/alertmanagerfor "no Prometheus connected" errors. - Alert inhibited by another rule: Alertmanager inhibition rules can hide related alerts during major incidents. Check
kubectl get secret alertmanager-config -n monitoring -o yamlfor inhibit_rules.
Cross-References¶
- Topic Packs: observability-deep-dive, prometheus-deep-dive, alerting-rules, k8s-ops
- Runbooks: prometheus_target_down.md, loki_no_logs.md