Pattern: STP Disabled + Loop Created¶
ID: FP-039 Family: Configuration Landmine Frequency: Uncommon Blast Radius: Cluster-Wide Detection Difficulty: Obvious (but chaotic)
The Shape¶
Spanning Tree Protocol (STP) prevents Layer 2 loops by selectively blocking redundant network paths. When STP is disabled for performance reasons (faster convergence, simpler config) and a physical or logical loop is later introduced (wrong cable, misconfigured trunk), Ethernet frames circulate indefinitely. Each switch floods the frame to all ports; the copies multiply; within seconds, the network is saturated with broadcast traffic. All devices on the VLAN become unreachable. This is a broadcast storm.
How You'll See It¶
In Datacenter¶
All servers on a VLAN simultaneously lose network connectivity. Switch CPU utilization
spikes to 100% (processing the flood of broadcast frames). Switch error counters show
explosive growth in broadcast/multicast packets. Port LEDs on all switches blink rapidly
and continuously. ping to anything on the affected VLAN times out. The switch's
management interface may also become unreachable if it's on the same VLAN.
In Linux/Infrastructure¶
Server running on the affected VLAN: netstat -s shows massive receive drops. ip -s
link show eth0 shows RX errors growing at millions per second. The NIC's interrupt load
saturates a CPU core. Applications on the server time out on all network operations.
In Kubernetes¶
If the affected VLAN carries pod traffic: all pods lose network connectivity simultaneously.
kubectl get nodes shows all nodes as NotReady. etcd loses quorum. Control plane
becomes unresponsive. The "cluster is down" but all pods and nodes are physically fine.
The Tell¶
All devices on a VLAN lose connectivity simultaneously. Switch CPU at 100% with high broadcast/multicast packet counts. Physical loop exists (two ports on the same switch connected together, or a cable connecting two switches in a ring without STP). Removing a single cable or re-enabling STP restores connectivity instantly.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| DDoS | Self-inflicted broadcast storm | Traffic source is internal broadcasts, not external; no external attack signature |
| Upstream network failure | Local loop | Switch CPU is high; all devices on ONE VLAN affected; upstream link OK |
| Server misconfiguration | Network loop | All servers on the VLAN are affected simultaneously |
The Fix (Generic)¶
- Immediate: Disconnect cables one by one until the storm stops (or use port shut commands on switches). Isolate the VLAN if possible.
- Short-term: Re-enable STP on affected VLANs; use RSTP (Rapid Spanning Tree) for fast convergence without the performance penalty of classic STP.
- Long-term: Enable BPDU Guard on all access ports (drops the port if STP BPDUs are received on a port that shouldn't have them); enable Loop Guard on trunk ports; use RSTP instead of disabling STP.
Real-World Examples¶
- Example 1: Network team disabled STP on a server VLAN "for 5ms faster convergence." 6 months later, a technician connected a patch cable between two ports on the same switch for a test and forgot to remove it. Broadcast storm brought down 200 servers' network connectivity for 18 minutes.
- Example 2: A VM was configured with a VLAN trunk interface and a bridge. The bridge inadvertently created a software loop between two network interfaces. With STP disabled on the physical switch, the loop wasn't blocked. The entire datacenter VLAN saturated within 3 seconds.
War Story¶
It was 2am. All 300 servers in our primary datacenter dropped off the network simultaneously. No hardware failures, no power issues. Switch CPU: 100%. I'd been on-call for 6 months and never seen this. Senior engineer on call: "broadcast storm. Find the loop." We started disconnecting cables on the core switches. On the third cable we disconnected, everything came back. Turns out a contractor had connected two ports on a ToR switch to "test the cable." STP was disabled on that VLAN (had been for 2 years, for "performance"). No one knew why anymore. We re-enabled RSTP on all VLANs the next morning. 2ms average latency increase. Worth every millisecond.
Cross-References¶
- Topic Packs: networking, datacenter
- Case Studies: networking/network-loop-broadcast-storm/
- Footguns: networking/footguns.md — "Disabling STP on VLANs without understanding loops"
- Related Patterns: FP-022 (dependency chain collapse — the cluster impact when the network goes down), FP-013 (simultaneous timer expiry — another "configuration removes protection")