Skip to content

Runbook: Alert Storm (Flapping / Too Many Alerts)

Field Value
Domain Observability
Alert >20 alerts firing simultaneously, or PagerDuty storm with rapid successive pages
Severity P1 (immediate operational impact)
Est. Resolution Time 20-45 minutes
Escalation Timeout 30 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites Prometheus/Alertmanager access, kubectl access, PagerDuty or on-call tool access

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093 &
# Then open http://localhost:9093 to see all firing alerts grouped by common labels
If output shows: All alerts share a common label (same node, namespace, or cluster) → You have a root cause — continue to Step 1 (do not fix individual alerts) If output shows: Alerts are from many different services with no common label → This may be a cascade from a shared dependency (network, storage, DNS) — continue to Step 1 and look harder for the common thread

Step 1: Identify the Root Cause by Finding the Common Label Across Alerts

Why: An alert storm is almost never 20 independent problems happening simultaneously. It is almost always one infrastructure failure (node down, network partition, DNS failure) causing dozens of downstream symptoms. Finding the shared label identifies the blast radius and points to the real fix.

# Query all currently firing alerts and extract their labels
curl -s http://localhost:9093/api/v2/alerts | python3 -m json.tool | grep -E '"alertname"|"node"|"namespace"|"cluster"|"instance"'

# Alternatively, check the Prometheus alerts page
curl -s 'http://localhost:9090/api/v1/alerts' | python3 -m json.tool | grep -E '"state": "firing"' -A10

# Check if a recent deployment triggered the storm
kubectl get events -n <NAMESPACE> --sort-by='.lastTimestamp' | tail -30
Expected output:
"alertname": "PodCrashLooping",
"node": "ip-10-0-1-45.ec2.internal",    all alerts share this node label
"namespace": "production"
If this fails: If alerts lack common labels, check for a common time of onset — all alerts starting at the same minute suggests a deployment, cron job, or external event triggered them.

Step 2: Silence Non-Critical Alerts in Alertmanager While Investigating

Why: When 20+ alerts are firing, the noise makes it impossible to reason about what matters. Silencing the symptom alerts (while preserving root cause alerts) gives you time and clarity to fix the actual problem without being interrupted repeatedly.

# Port-forward to Alertmanager if not already done
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093 &

# Add a silence for the common label (e.g., all alerts on a specific node)
# Using amtool (install: go install github.com/prometheus/alertmanager/cmd/amtool@latest)
amtool --alertmanager.url=http://localhost:9093 silence add \
  node="<PROBLEM_NODE>" \
  --comment "Alert storm investigation — root cause: <SUSPECTED_CAUSE>" \
  --duration 1h \
  --author "<YOUR_NAME>"

# Or silence by namespace
amtool --alertmanager.url=http://localhost:9093 silence add \
  namespace="<PROBLEM_NAMESPACE>" \
  --comment "Alert storm investigation" \
  --duration 1h \
  --author "<YOUR_NAME>"
Expected output:
Created silence <SILENCE_UUID>
CRITICAL: Do NOT silence the root cause alert (e.g., NodeNotReady, KubeNodeNotReady). Only silence downstream symptom alerts. If you silence the root cause, you lose visibility into whether the fix worked. If this fails: If amtool is not installed, use the Alertmanager UI at http://localhost:9093/#/silences to create a silence manually.

Step 3: Find the Triggering Event

Why: The storm started at a specific moment. Identifying what changed at that moment (a deployment, a config change, a cron job, an infrastructure event) tells you what to roll back or fix.

# Check for recent deployments across all namespaces
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep -E 'Scaled|Updated|Deployed|Started' | tail -20

# Check for node events
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep -E 'NotReady|Evicted|OOM|DiskPressure' | tail -20

# Check recent Helm releases
helm list --all-namespaces | head -20
helm history <RELEASE_NAME> -n <NAMESPACE> | tail -5

# Check if there was a cloud infrastructure event
# (Check your cloud provider's status page or health dashboard)
Expected output:
NAMESPACE     LAST SEEN   REASON           OBJECT                  MESSAGE
production    2m          ScalingReplicaSet deploy/myapp-v2-broken  Scaled down replica set ...
If this fails: If no obvious event is found, check the cloud provider status page and the network/storage layers — alert storms with no recent deployment often trace to infrastructure events.

Step 4: Confirm Alerts Are Symptom vs. Cause

Why: Fixing 15 symptom alerts individually (pod restarts, high memory, slow queries) when they all stem from a single node failure is a waste of time and can make the situation worse. You must fix the root cause.

# Example: if a node is down, check it directly
kubectl get nodes
kubectl describe node <PROBLEM_NODE> | tail -30

# If a namespace has many pod failures, check for a resource quota or limit issue
kubectl describe namespace <NAMESPACE> | grep -A10 'Resource Quotas'

# Check if there is a single failing service all others depend on (e.g., database, message queue, config service)
kubectl get pods -n <NAMESPACE> -o wide | grep -v Running
Expected output (a clear root cause):
NAME                STATUS     NODE
ip-10-0-1-45...     NotReady   <none>     ← node is down; all pod alerts from this node are symptoms
If this fails: If you cannot identify the root cause within 15 minutes, escalate immediately — see the escalation table below. Do not continue debugging symptoms in isolation.

Step 5: Fix the Root Cause Using the Appropriate Runbook

Why: The alert storm does not resolve until the underlying failure is resolved. All individual alert runbooks assume you are dealing with an isolated incident — in a storm scenario, the root cause runbook takes priority.

# Example: if a node is unhealthy, drain and replace it
kubectl drain <PROBLEM_NODE> --ignore-daemonsets --delete-emptydir-data
# Then trigger node replacement through your cloud provider or cluster autoscaler

# Example: if a deployment is bad, roll it back
kubectl rollout undo deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>
kubectl rollout status deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>

# Example: if DNS is failing
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl rollout restart deployment/coredns -n kube-system
Expected output (recovery):
deployment.apps/<DEPLOYMENT_NAME> successfully rolled out
If this fails: Escalate — see below. Do not attempt to fix multiple root causes simultaneously; if the storm has more than one root cause, handle them sequentially.

Step 6: Drain the Silence After Fix and Verify Alert Count Drops

Why: Silences are temporary but should be removed as soon as the fix is confirmed. Leaving old silences in place hides future alerts on the same label, creating blind spots.

# List active silences to find the one you created
amtool --alertmanager.url=http://localhost:9093 silence query

# Expire the silence early (replace <SILENCE_UUID> with the UUID from Step 2)
amtool --alertmanager.url=http://localhost:9093 silence expire <SILENCE_UUID>

# Verify alerts are draining — count should drop below 5 within 5 minutes of the fix
curl -s http://localhost:9093/api/v2/alerts | python3 -m json.tool | grep '"alertname"' | wc -l
Expected output:
# silence query: zero active silences (or only ones created for other purposes)
# alert count: drops from 20+ to 0-3 within one scrape cycle (30-60 seconds)
If this fails: If alert count stays high after the fix and silence is expired, the root cause is not fully resolved — re-examine Steps 4 and 5.

Verification

# Confirm the alert storm has cleared
curl -s http://localhost:9093/api/v2/alerts | python3 -m json.tool | grep '"state": "active"' | wc -l
Success looks like: Alert count returns to baseline (typically 0-2 in a healthy system), PagerDuty stops paging. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 30 min Platform / SRE on-call lead "Alert storm: alerts firing, root cause not yet identified; investigated common labels , "
Data loss suspected DBA / Data Lead "Possible data loss during alert storm: services were down for minutes; database may have received incomplete writes"
Scope expanding Platform team "Alert storm spreading to additional namespaces/clusters; suspected cause: ; current blast radius:
"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete
  • Add Alertmanager inhibition rules so that when a node is down, downstream pod alerts are automatically suppressed
  • Review alert grouping configuration — alerts should be grouped by root cause, not by symptom
  • Set up a "high alert volume" meta-alert: if more than 10 alerts fire in 2 minutes, page the on-call lead directly
  • Communicate the storm timeline to affected teams after resolution

Common Mistakes

  1. Trying to fix each alert independently instead of finding the root cause: This is the most damaging mistake during a storm. You will spend 45 minutes fixing symptoms while the root cause continues generating new ones. Always look for the common label first.
  2. Silencing the root cause alert: Silencing NodeNotReady to reduce noise means you lose the signal that the node is still down. Only silence symptom alerts (pod crashes, high latency) that are downstream of the known root cause.
  3. Not communicating to the team during the storm: An alert storm is visible to everyone in your organization who is on-call or watching dashboards. Post a status update in Slack/incident channel within the first 5 minutes so others are not duplicating your investigation.
  4. Forgetting to expire silences after the fix: A silence left active for 23 more hours means you will miss real alerts on those same labels. Always expire silences explicitly after confirming resolution.

Cross-References

  • Topic Pack: Alertmanager Configuration and Routing (deep background on routing trees, inhibition rules, silences, and grouping)
  • Related Runbook: prometheus-target-down.md — a common root cause for alert storms is a cluster-wide scrape failure
  • Related Runbook: grafana-blank.md — alert storms are often accompanied by blank dashboards if the root cause affects the monitoring stack

Wiki Navigation