Skip to content

Decision Tree: Alert Fired — Is This Real?

Category: Incident Triage Starting Question: "An alert fired — is this a real incident or noise?" Estimated traversal: 2-4 minutes Domains: observability, kubernetes, linux-performance


The Tree

An alert fired  is this a real incident or noise?
├── What is the alert severity?
      ├── Critical / Page
      └── Treat as real until proven otherwise.
          Acknowledge within SLA, then validate  continue below
      ├── Warning
      └── Validate before escalating  continue below
      └── Info / Watchdog
       └── Informational  log it, no action needed unless part of a pattern
├── Is the metric value actually above threshold right now?
   (Go directly to the metric in Prometheus/Grafana  do not trust the alert label alone)
      ├── Look up the metric: copy the PromQL from the alert rule and run it
      `kubectl exec -n monitoring deploy/prometheus -- \`
      `  promtool query instant http://localhost:9090 '<promql-from-rule>'`
            ├── Value is below threshold  alert is already recovering
                  ├── Was it a spike? Check 30-min graph
            └── Yes, brief spike   ACTION: Snooze and Investigate Threshold
                  └── Value never rose (stale firing alert)?
             `kubectl get prometheusrule -A | grep <alert-name>`
             └──  ACTION: Check Alertmanager for Stale Firing State
            └── Value is above threshold  alert is real  continue
   ├── How long has the metric been above threshold?
   (Check the Grafana time series  zoom out to last 2 hours)
      ├── < 5 minutes (very recent)
      └── Watch it for 2 more minutes before full incident declaration
          `watch -n 15 'kubectl top pods -n <namespace>'`
      ├── 5-30 minutes (sustained)
      └── Real issue  proceed to impact assessment below
      └── >30 minutes (long-duration)
       └── Likely already user-impacting  escalate immediately, investigate in parallel
            ⚠️ ESCALATION: On-Call Engineer
├── Is this a systemic issue? (Are multiple alerts firing simultaneously?)
   Check Alertmanager: `kubectl exec -n monitoring deploy/alertmanager -- \`
   `  amtool alert query --alertmanager.url=http://localhost:9093`
      ├── Yes  many alerts across multiple services/nodes
            ├── Is there a recent change? (deployment, config, infra)
         `kubectl get events --all-namespaces --sort-by=.lastTimestamp | tail -30`
         └── Yes  likely one root cause  treat as major incident
              ⚠️ ESCALATION: Incident Commander
            └── No recent change  external cause (upstream dependency, cloud provider)
          Check cloud provider status page + your dependency status pages
           ⚠️ ESCALATION: Incident Commander + vendor contact
      └── No  isolated single alert  likely a specific service issue
        continue to user impact check
├── Is the symptom user-visible?
      ├── Check synthetic monitoring / uptime probes
      `kubectl get probe -A` (Prometheus Blackbox)
      or check your external uptime monitor dashboard
            ├── External probes showing errors  user-impacting  declare incident
          ⚠️ ESCALATION: On-Call + Stakeholder Notification
            └── External probes healthy  internal issue not yet user-facing
          Log it, start investigating, set watchdog
      └── No uptime monitoring available
       Check error rate in metrics: `rate(http_requests_total{status=~"5.."}[5m])`
       └── Error rate elevated  treat as user-impacting
├── Is this alert known to be flapping?
   Look at alert history: Grafana  Alerting  Alert History for this rule
      ├── Fired >3 times in 24h with short durations
            ├── Threshold is too tight for normal variance
         └──  ACTION: Fix Alert Threshold / Add for duration
            └── Underlying issue is real but intermittent
          └──  ACTION: Fix Root Cause / Add Pending Duration to Alert
      └── First time firing  not a flapping issue
└── Is this during a maintenance window or scheduled event?
    Check your change calendar / on-call handoff notes
        ├── Yes  alert is expected   ACTION: Silence Alert for Maintenance Window
        └── No  not expected  investigate normally

Node Details

Check 1: Verify the metric value directly

Command: In Prometheus UI: navigate to Graph, paste the PromQL from the alert rule, click Execute. Or via CLI: kubectl exec -n monitoring deploy/prometheus -- wget -qO- 'http://localhost:9090/api/v1/query?query=<encoded-promql>' | jq '.data.result' What you're looking for: The actual current value, not just the alert label. An alert may still be "firing" after the metric recovered if Alertmanager hasn't received a resolution yet. Common pitfall: Alertmanager continues to show an alert as firing for up to resolve_timeout (default: 5 min) after the metric drops below threshold. Do not assume the metric is still high just because the alert is still showing.

Check 2: Multiple alerts firing simultaneously

Command: kubectl exec -n monitoring deploy/alertmanager -- amtool alert query --alertmanager.url=http://localhost:9093 | head -40 or use the Alertmanager UI directly. What you're looking for: Alerts from multiple different services or nodes firing within the same 5-minute window. This pattern indicates a shared dependency or infrastructure event rather than an isolated service bug. Common pitfall: Alert deduplication hides grouped alerts. In Alertmanager UI, expand all alert groups before concluding "only one alert is firing".

Check 3: Alert history / flapping

Command: In Grafana: Alerting → Alert History (search for rule name). In Prometheus: ALERTS{alertname="<name>"} — this shows currently firing alerts. For history, check your Alertmanager webhook receiver logs or long-term alerting storage. What you're looking for: Patterns like "fires for 2 min, recovers, fires again" on a daily cycle. This is textbook threshold-too-tight behavior. Common pitfall: An alert that fires briefly every morning at 9am is not random — it correlates with daily traffic peaks. Adjust the threshold or add for: 10m to require sustained breaches.

Check 4: Is symptom user-visible?

Command: Blackbox exporter probes: kubectl get servicemonitor -n monitoring | grep blackbox. Check probe_success metric: probe_success{job="blackbox"}. External check: curl -w "%{http_code}" -o /dev/null -s https://<your-service>/healthz. What you're looking for: probe_success = 0 or HTTP status != 200 from external probes confirms user-facing impact. Common pitfall: Internal healthcheck endpoints often pass even when the service is degraded (they don't test end-to-end functionality). A service returning 200 on /healthz can still be returning 500 on /api/v1/orders.

Check 5: Recent changes

Command: kubectl get events --all-namespaces --sort-by=.lastTimestamp | tail -20 and kubectl rollout history deployment --all-namespaces | grep -v "<none>". Also check your CI/CD system for recent deploys. What you're looking for: Deployments, ConfigMap changes, certificate rotations, HPA scaling events in the last 30-60 minutes that coincide with the alert onset. Common pitfall: Helm upgrades that only change values (not image) don't show up in kubectl rollout history. Check helm history <release> as well.


Terminal Actions

Action: Acknowledge and Investigate

Do: 1. Acknowledge the alert in PagerDuty / OpsGenie / Alertmanager to silence further pages 2. Open the alert's runbook link (every critical alert should have one) 3. Collect initial data: time of onset, affected services, current metric value 4. Post in incident channel: "Investigating [alert name] — [current status]" Verify: Alert acknowledged. Investigation underway with initial findings posted.

Action: Snooze and Investigate Threshold

Do: 1. In Alertmanager UI: create a silence for this alert for 24h to prevent repeat pages 2. kubectl annotate --overwrite prometheusrule <rule-name> -n monitoring note="threshold under review" 3. Check if for: <duration> is set in the alert rule — adding for: 5m eliminates spikes 4. Review normal metric variance over 7 days to set a better threshold Verify: Alert no longer fires for brief spikes. Document the change in your runbook. Runbook: prometheus_target_down.md

Action: Fix Alert Threshold / Add for Duration

Do: 1. kubectl get prometheusrule <name> -n monitoring -o yaml > /tmp/rule-backup.yaml 2. Edit the rule to add or increase for: 5m (require sustained breach, not just a spike) 3. Or adjust the threshold expression: > 0.95> 0.98 4. kubectl apply -f /tmp/rule-backup.yaml 5. Monitor for 24h to confirm it doesn't fire spuriously Verify: Alert does not fire on brief spikes. Still fires on sustained issues.

Action: Fix Root Cause / Add Pending Duration

Do: 1. Investigate the underlying intermittent issue (check logs, events, metrics around each firing) 2. Fix the root cause (connection leak, retry storm, etc.) — this is the priority 3. As interim measure: for: 10m in alert rule to require sustained breach before paging Verify: Alert stops flapping. Root cause no longer produces the metric spikes.

Action: Silence Alert for Maintenance Window

Do: 1. In Alertmanager UI: Silences → New Silence 2. Set matchers: alertname=<name>, and optionally scope to specific labels 3. Set start/end time to cover the maintenance window + 15 min buffer 4. Add comment: "Maintenance window — [change ticket ID]" Verify: Alert fires in Prometheus but is silenced in Alertmanager. Remove silence after maintenance.

Action: Check Alertmanager for Stale Firing State

Do: 1. kubectl exec -n monitoring deploy/alertmanager -- amtool alert query --alertmanager.url=http://localhost:9093 --filter='alertname=<name>' 2. If the alert is listed but metric is below threshold, it will auto-resolve after resolve_timeout 3. To force resolution: delete the silence or wait; do not manually delete alerts Verify: Alert clears in Alertmanager within 5 minutes of metric recovery.

Escalation: On-Call Engineer

When: Alert is sustained >10 min, metric is above threshold, and symptom may be user-facing. Who: On-call SRE per the rotation in PagerDuty/OpsGenie Include in page: Alert name, duration, current metric value, whether user-facing probes are affected, last deployment time

Escalation: Incident Commander

When: Multiple services alerting simultaneously, user impact confirmed, no obvious single root cause. Who: Incident commander (senior on-call or on-call manager) Include in page: List of all firing alerts, onset time, recent changes (deployments, infra), whether cloud provider status shows issues


Edge Cases

  • Alert fires in test/staging but not production: Thresholds may be configured identically but traffic patterns differ. Lower environments may fire on trivial load. Scope alert labels to env.
  • Alert fires every day at the same time: Traffic-correlated threshold. Either raise the threshold, switch to relative alerts (percent change), or use time-of-day based alerting inhibition.
  • Prometheus target is down, but service is up: Scrape endpoint may be broken (wrong port, broken /metrics handler). The alert may be a monitoring gap, not a service failure. Check up metric: up{job="<svc>"}.
  • Alertmanager shows 0 alerts but Prometheus shows ALERTS firing: Alertmanager and Prometheus may be misconfigured to not communicate. Check kubectl logs -n monitoring deploy/alertmanager for "no Prometheus connected" errors.
  • Alert inhibited by another rule: Alertmanager inhibition rules can hide related alerts during major incidents. Check kubectl get secret alertmanager-config -n monitoring -o yaml for inhibit_rules.

Cross-References