Skip to content

Incident Replay: Network Loop — Broadcast Storm

Setup

  • System context: Office network with 3 access switches and 1 distribution switch. A user plugged a rogue switch into two wall ports, creating a physical loop. Network is down.
  • Time: Monday 09:00 UTC
  • Your role: Network engineer

Round 1: Alert Fires

[Pressure cue: "Entire office floor offline. Switch CPUs at 100%. ARP tables full. Spanning tree BPDU counters skyrocketing."]

What you see: All switch ports show massive broadcast traffic. CPU utilization on all switches at 100%. STP topology change notifications flooding. One switch shows a port with abnormally high BPDU count — port Gi1/0/24 has a rogue device creating a loop.

Choose your action: - A) Apply a quick workaround to restore service - B) Investigate the root cause systematically - C) Escalate to the vendor or upstream provider - D) Check if a recent change caused the issue

If you chose A:

[Result: Workaround provides temporary relief but masks the underlying issue. You will need to circle back.]

[Result: Systematic investigation reveals the root cause. A user connected an unmanaged switch to two wall jacks, creating a physical loop. The access port did not have BPDU Guard or Loop Guard enabled. STP reconverged slowly because of the flood. Proceed to Round 2.]

If you chose C:

[Result: Vendor/upstream confirms the issue is on your side. Time wasted on external coordination.]

If you chose D:

[Result: Change log review helps narrow the timeline but does not directly identify the technical cause. Partial progress.]

Round 2: First Triage Data

[Pressure cue: "Root cause identified. Apply the fix."]

What you see: A user connected an unmanaged switch to two wall jacks, creating a physical loop. The access port did not have BPDU Guard or Loop Guard enabled. STP reconverged slowly because of the flood.

Choose your action: - A) Apply the targeted fix - B) Apply the fix and verify with testing - C) Apply a broader fix that addresses the class of problem - D) Document and schedule the fix for the next maintenance window

[Result: Shut down the offending port immediately. Enable BPDU Guard and storm control on all access ports. Physically locate and remove the rogue switch. Enable root guard on distribution switch ports. Service restored and verified. Proceed to Round 3.]

If you chose A:

[Result: Fix applied but not verified. May not be complete.]

If you chose C:

[Result: Broader fix is correct long-term but takes longer to implement during an incident.]

If you chose D:

[Result: Delaying the fix extends the outage or degradation. Apply now if possible.]

Round 3: Root Cause Identification

[Pressure cue: "Service restored. Document and prevent recurrence."]

What you see: Root cause confirmed: A user connected an unmanaged switch to two wall jacks, creating a physical loop. The access port did not have BPDU Guard or Loop Guard enabled. STP reconverged slowly because of the flood.

Choose your action: - A) Document the fix in the runbook - B) Add monitoring to detect this condition - C) Add the fix to automation/configuration management - D) All of the above

[Result: Documentation, monitoring, and automation all updated. Defense in depth prevents recurrence. Proceed to Round 4.]

If you chose A:

[Result: Documentation helps but relies on humans remembering to check it.]

If you chose B:

[Result: Monitoring detects faster but does not prevent.]

If you chose C:

[Result: Automation prevents recurrence but needs monitoring for edge cases.]

Round 4: Remediation

[Pressure cue: "Verify everything and close the incident."]

Actions: 1. Verify service is functioning correctly end-to-end 2. Verify monitoring detects the condition 3. Update runbooks and configuration management 4. Schedule post-mortem review 5. Check for similar issues across the infrastructure

Damage Report

  • Total downtime: Varies based on path chosen
  • Blast radius: Affected services and dependent systems
  • Optimal resolution time: 8 minutes
  • If every wrong choice was made: 45 minutes + additional damage

Cross-References