Skip to content

Pattern: STP Disabled + Loop Created

ID: FP-039 Family: Configuration Landmine Frequency: Uncommon Blast Radius: Cluster-Wide Detection Difficulty: Obvious (but chaotic)

The Shape

Spanning Tree Protocol (STP) prevents Layer 2 loops by selectively blocking redundant network paths. When STP is disabled for performance reasons (faster convergence, simpler config) and a physical or logical loop is later introduced (wrong cable, misconfigured trunk), Ethernet frames circulate indefinitely. Each switch floods the frame to all ports; the copies multiply; within seconds, the network is saturated with broadcast traffic. All devices on the VLAN become unreachable. This is a broadcast storm.

How You'll See It

In Datacenter

All servers on a VLAN simultaneously lose network connectivity. Switch CPU utilization spikes to 100% (processing the flood of broadcast frames). Switch error counters show explosive growth in broadcast/multicast packets. Port LEDs on all switches blink rapidly and continuously. ping to anything on the affected VLAN times out. The switch's management interface may also become unreachable if it's on the same VLAN.

In Linux/Infrastructure

Server running on the affected VLAN: netstat -s shows massive receive drops. ip -s link show eth0 shows RX errors growing at millions per second. The NIC's interrupt load saturates a CPU core. Applications on the server time out on all network operations.

In Kubernetes

If the affected VLAN carries pod traffic: all pods lose network connectivity simultaneously. kubectl get nodes shows all nodes as NotReady. etcd loses quorum. Control plane becomes unresponsive. The "cluster is down" but all pods and nodes are physically fine.

The Tell

All devices on a VLAN lose connectivity simultaneously. Switch CPU at 100% with high broadcast/multicast packet counts. Physical loop exists (two ports on the same switch connected together, or a cable connecting two switches in a ring without STP). Removing a single cable or re-enabling STP restores connectivity instantly.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
DDoS Self-inflicted broadcast storm Traffic source is internal broadcasts, not external; no external attack signature
Upstream network failure Local loop Switch CPU is high; all devices on ONE VLAN affected; upstream link OK
Server misconfiguration Network loop All servers on the VLAN are affected simultaneously

The Fix (Generic)

  1. Immediate: Disconnect cables one by one until the storm stops (or use port shut commands on switches). Isolate the VLAN if possible.
  2. Short-term: Re-enable STP on affected VLANs; use RSTP (Rapid Spanning Tree) for fast convergence without the performance penalty of classic STP.
  3. Long-term: Enable BPDU Guard on all access ports (drops the port if STP BPDUs are received on a port that shouldn't have them); enable Loop Guard on trunk ports; use RSTP instead of disabling STP.

Real-World Examples

  • Example 1: Network team disabled STP on a server VLAN "for 5ms faster convergence." 6 months later, a technician connected a patch cable between two ports on the same switch for a test and forgot to remove it. Broadcast storm brought down 200 servers' network connectivity for 18 minutes.
  • Example 2: A VM was configured with a VLAN trunk interface and a bridge. The bridge inadvertently created a software loop between two network interfaces. With STP disabled on the physical switch, the loop wasn't blocked. The entire datacenter VLAN saturated within 3 seconds.

War Story

It was 2am. All 300 servers in our primary datacenter dropped off the network simultaneously. No hardware failures, no power issues. Switch CPU: 100%. I'd been on-call for 6 months and never seen this. Senior engineer on call: "broadcast storm. Find the loop." We started disconnecting cables on the core switches. On the third cable we disconnected, everything came back. Turns out a contractor had connected two ports on a ToR switch to "test the cable." STP was disabled on that VLAN (had been for 2 years, for "performance"). No one knew why anymore. We re-enabled RSTP on all VLANs the next morning. 2ms average latency increase. Worth every millisecond.

Cross-References