Postmortem: Core Switch Firmware Bug Causes Cascading Network Partition¶

Field	Value
ID	PM-004
Date	2025-09-03
Severity	SEV-1
Duration	3h 42m (detection to resolution)
Time to Detect	2m
Time to Mitigate	3h 42m
Customer Impact	61% of production services became unreachable for the full 3h 42m. Approximately 34,000 concurrent users lost access to all application features. Internal tooling (CI/CD, monitoring, VPN gateway) also partitioned for the duration.
Revenue Impact	~$248,000 estimated (3.7h × avg $67k/h for affected service slice)
Teams Involved	Network Operations (NetOps), Datacenter Engineering, SRE, Platform Engineering, Infrastructure Leadership
Postmortem Author	Owen Mackenzie (Senior Network Operations Engineer)
Postmortem Date	2025-09-07

Executive Summary¶

On 2025-09-03, a Juniper QFX5120-32C core switch running firmware version 21.4R3-S4 triggered an STP (Spanning Tree Protocol) recalculation storm following a routine LACP flap on an uplink between core-sw-01 and agg-sw-03. The firmware bug caused the switch to broadcast topology change notifications continuously rather than stabilizing after convergence, forcing all downstream switches to flush their MAC address tables in a loop. This created a cascading L2 forwarding disruption that took approximately 18% of traffic and grew to affect 61% of rack-to-rack connectivity within 4 minutes. Because the management plane was on the same L2 broadcast domain, SSH access to all affected switches was lost simultaneously. Physical console access to all six affected switches was required to diagnose and mitigate. Recovery required downgrading the firmware on three core switches to 21.4R3-S2 and manually restoring STP topology. Total outage duration was 3 hours 42 minutes.

Timeline (All times UTC)¶

Time	Event
18:04	LACP PDU timeout on uplink between `core-sw-01` (Rack C2, Juniper QFX5120) and `agg-sw-03` (Rack F4, Juniper QFX5110). Uplink member port `et-0/0/28` experiences a 3-second signal loss. LACP marks the port inactive. This is a routine event that occurs 2–4 times per month due to transceiver heat cycling.
18:04	`core-sw-01` generates an STP Topology Change Notification (TCN) for the topology change. Normal behavior.
18:04	Firmware bug triggers: `core-sw-01` enters a loop, rebroadcasting the TCN every 80ms rather than once per topology change event. Downstream switches begin flushing MAC tables continuously.
18:05	MAC table flush storm causes L2 forwarding to degrade as switches fall back to flooding for unknown destinations. Traffic levels spike 340% on all inter-switch trunks as flooded unicast saturates links.
18:06	Datadog synthetic checks begin failing across multiple services. PagerDuty fires: `[CRITICAL] 14 synthetic checks failing simultaneously`. SRE on-call (Yara Holst) acknowledges.
18:06	NetOps automated monitoring (`network-mon-01`) loses SNMP reachability to `core-sw-01`, `core-sw-02`, `agg-sw-01`, `agg-sw-02`, `agg-sw-03`. SSH to all management IPs unreachable — management plane is on the same broadcast domain.
18:07	Yara attempts to SSH to `core-sw-01` management IP (10.0.1.1). Connection times out. Attempts to SSH to `agg-sw-01` (10.0.2.1). Times out. Pages NetOps on-call (Owen Mackenzie) and Datacenter Engineering on-call (Sanjay Patel). SEV-1 declared.
18:09	Owen joins. First hypothesis: upstream ISP link failure causing BGP withdrawal. Checks BGP monitoring dashboard — BGP sessions to ISP peers (CenturyLink, Cogent) show `Established`. ISP hypothesis rejected.
18:11	Second hypothesis: DDoS causing link saturation. Checks Arbor NetScout — no anomalous inbound traffic patterns. DDoS hypothesis rejected.
18:14	Sanjay arrives at DC floor (onsite). Notes all switch panel LEDs are in a normal state — green, no amber or red on port LEDs. Begins physical console access to `core-sw-01` via serial console server (`console-srv-01`). Console server is on a separate management VLAN backed by cellular uplink — still reachable.
18:19	Sanjay connects to `core-sw-01` console. Runs `show spanning-tree`. Output shows STP state oscillating: bridge roles flipping between Root and Designated every few seconds.
18:21	Owen identifies the TCN storm pattern in the STP output. Cross-references with Juniper security advisories on NetOps laptop (not network-dependent). Finds `JSA-2025-0044`: known firmware bug in QFX5120 21.4R3-S4 causing STP TCN storm on LACP flap events. Advisory published 2025-08-14.
18:23	Firmware upgrade changelog review: `core-sw-01`, `core-sw-02`, and `core-sw-03` were upgraded to 21.4R3-S4 on 2025-07-29 during a scheduled maintenance window. The JSA advisory was published 16 days after the upgrade.
18:25	Decision: downgrade `core-sw-01` to 21.4R3-S2 (last known-good). Sanjay begins downgrade sequence from console. Estimated time: 12–15 minutes including reboot.
18:26	Platform Engineering on-call (Tamar Cohen) joins. Begins identifying which racks are affected (Racks C1–C4, F1–F4 = 47 of 78 production racks). Racks E1–E8 and G1–G8 are unaffected (connected to `agg-sw-04`, `agg-sw-05` which are on QFX5110 firmware, not affected).
18:38	`core-sw-01` comes back online with 21.4R3-S2. TCN storm stops on C-row racks. STP reconverges in 28 seconds. C-row hosts begin recovering.
18:41	SSH access to `agg-sw-01` and `agg-sw-02` restored. Owen verifies STP topology stable on C-row.
18:44	`core-sw-02` console access obtained by Sanjay (second serial port on console-srv-01). Downgrade initiated.
18:58	`core-sw-02` back online with 21.4R3-S2. F1–F2 rows recovering.
19:02	Sanjay physically moves to console-srv-02 in Rack B1 to access `core-sw-03` console.
19:09	`core-sw-03` downgrade initiated.
19:22	`core-sw-03` back online. All C, E, and F row racks recovered. `agg-sw-03` and `agg-sw-04` reconnect.
19:40	Full network topology stable. All 78 production racks have L2 and L3 connectivity. Synthetic monitoring checks passing.
21:46	Extended monitoring watch completed. SNMP polling confirmed stable for 2h 6m. Incident formally closed.

Impact¶

Customer Impact¶

34,000 concurrent users lost access to all application features for 3h 42m
Affected services: API gateway, web frontend, authentication service, search, recommendations, order service (all on C/F row racks)
Unaffected services: Analytics pipeline, batch jobs, cold storage (E/G row racks) — no customer-visible functions
All CI/CD pipelines paused automatically (Kubernetes control plane on C-row was unavailable); 7 active deployments were cancelled and required re-queue
VPN gateway (C-row) unreachable for 3h 42m — remote engineers could not SSH to jump hosts; 4 on-call engineers required to use cellular hotspot fallback for tooling access

Internal Impact¶

NetOps: 2 engineers × 5h = 10 engineer-hours
Datacenter Engineering: 2 engineers × 4h = 8 engineer-hours (including physical on-site time)
SRE: 3 engineers × 4h = 12 engineer-hours
Platform Engineering: 2 engineers × 2h = 4 engineer-hours
Leadership bridge call: ~8 attendees × 1.5h = 12 hours total
Firmware downgrade deferred a scheduled agg-sw-04 planned maintenance window (rescheduled to 2025-09-17)

Data Impact¶

None. The network partition was a transport-layer event — no application state was modified, corrupted, or lost. All in-flight requests either completed before the partition or received TCP RST and were retried by clients. No database writes were interrupted mid-transaction (PostgreSQL's synchronous_commit ensured no partial commits); confirmed by DRE post-incident query.

Root Cause¶

What Happened (Technical)¶

The Juniper QFX5120-32C running firmware 21.4R3-S4 contains a defect in its spanning tree implementation (documented in Juniper advisory JSA-2025-0044, published 2025-08-14). When an LACP member port experiences a PDU timeout and is marked inactive, the STP subsystem on the affected switch generates a Topology Change Notification. Under normal firmware behavior, this TCN is sent once per topology change event, downstream switches flush their MAC tables, and STP reconverges within 30–60 seconds.

The defect in 21.4R3-S4 causes the STP process to enter an internal loop condition where the TCN is rebroadcast every 80ms for as long as the triggering LACP event persists (in this case, until the port physically recovered). This continuous TCN storm caused all downstream switches (agg-sw-01 through agg-sw-05 and all ToR switches) to flush their MAC address tables every 80ms. With cleared MAC tables, switches have no forwarding entries for unicast traffic and must flood all unicast frames out all ports. This flooded unicast traffic saturated inter-switch trunks, which in turn delayed MAC table re-learning, creating a feedback loop.

The management plane — specifically, all switch management IP addresses — were on VLAN 10, which is carried on the same trunk infrastructure as production traffic. When the MAC table storm caused trunk saturation, management VLAN traffic was indistinguishable from production traffic and was also disrupted. This meant that SSH to all affected switches was unavailable during the outage, and all network management operations required physical console access.

The affected switches (core-sw-01, core-sw-02, core-sw-03) had been upgraded to 21.4R3-S4 on 2025-07-29. Juniper published the advisory for this defect on 2025-08-14 — 16 days after the upgrade. The NetOps team's process for applying vendor security advisories requires monthly review of active advisories; the next scheduled review was 2025-09-10, 7 days after the incident.

The network topology had a known fragility: the C-row and F-row racks shared a single pair of core switches (core-sw-01 and core-sw-02) with no independent L3 path for management traffic. This was documented in the 2025 Q1 infrastructure review as a topology risk, rated medium priority, with a remediation plan to install a dedicated out-of-band management network (target: Q3 2025). The project was deferred to Q4 due to hardware procurement delays.

Contributing Factors¶

Firmware advisory review was monthly rather than continuous: The NetOps team reviews Juniper security advisories on a monthly cadence. The JSA-2025-0044 advisory was published 16 days after the affected firmware was deployed, placing it squarely in the gap between review cycles. A continuous or weekly advisory review would have detected the advisory within days and triggered an emergency downgrade before any LACP flap event occurred.
No out-of-band management network: All switch management IPs were on VLAN 10, carried on the same trunks as production traffic. When production trunks became saturated and MAC tables thrashed, management traffic was equally disrupted. SSH access to all affected switches was lost simultaneously. Physical console access — requiring an engineer physically on the datacenter floor — became the only management path. An out-of-band management network (dedicated VLAN on a separate physical path, or cellular-backed OOB console servers) was planned but not yet implemented.
STP topology fragility was documented but not remediated: The Q1 2025 infrastructure review identified the single-pair core switch design as a topology risk for C/F-row racks. Remediation was deprioritized due to hardware procurement delays. The planned fix (adding a third core switch in an ECMP triangle topology) would have provided an alternate L2 path that the TCN storm could not have completely severed.

What We Got Lucky About¶

The L2 partition split cleanly along rack-row boundaries: C/F-row racks (affected) vs E/G-row racks (unaffected). This meant the unaffected racks formed a coherent subset — analytics, batch processing, and cold storage continued running without interruption. Had the partition bisected individual service clusters (e.g., taking down 3 of 5 Kafka brokers), the blast radius would have been substantially larger and recovery more complex.
The console server (console-srv-01) was backed by a separate cellular uplink that was completely unaffected by the L2 disruption. Without console server access, Sanjay would have needed to bring a physical laptop to each switch's serial port directly, adding approximately 45–60 minutes to recovery time.

Detection¶

How We Detected¶

Synthetic monitoring checks (Datadog) began failing 1 minute after the LACP flap triggered the TCN storm. PagerDuty fired a critical alert at T+2m with 14 simultaneous synthetic check failures, which immediately indicated a network-level event rather than an application-level fault (isolated application failures would affect 1–3 checks, not 14 simultaneously). The SRE on-call acknowledged within 60 seconds.

Why We Didn't Detect Sooner¶

Detection was fast (2 minutes). The primary gap is that detection came from service-level synthetic checks rather than from network-level SNMP/telemetry alerts. The SNMP monitoring tool (network-mon-01) lost reachability to all affected switches simultaneously, which caused all its alerts to fire as "unreachable" rather than providing diagnostic information about STP state. A network telemetry system (e.g., Junos Telemetry Interface streaming to an external collector on the OOB network) would have captured STP state data even during the management plane disruption.

Response¶

What Went Well¶

The console server's cellular uplink provided a reliable management path when in-band SSH was completely unavailable. The value of this cellular-backed console server was demonstrated concretely — it was the sole management vector for all switch operations during the incident.
The cross-reference to Juniper advisory JSA-2025-0044 was made within 17 minutes of gaining console access and confirmed root cause definitively. Owen's familiarity with Juniper advisory search was a key factor.
The firmware downgrade procedure (QFX image staging, ISSU-style reboot) was well-documented in NetOps runbooks and executed correctly under pressure. Each core switch was downgraded in 12–15 minutes, close to the estimated time.
The partition along rack-row boundaries was identified early (T+22m), which allowed Platform Engineering to accurately characterize which services were affected and communicate this to leadership and customers.

What Went Poorly¶

Only one engineer (Sanjay) was physically on-site. Console-srv-01 had two serial ports, but the third core switch (core-sw-03) required physical movement to console-srv-02 in a different rack. With two on-site engineers, all three switches could have been downgraded in parallel, potentially reducing recovery time by 40–50 minutes.
The monthly firmware advisory review cadence was too slow to catch a critical advisory published mid-cycle. The 16-day gap between deployment and advisory publication, combined with the 30-day review cycle, created a situation where a critical bug was running in production for 36 days before it was triggered.
There was no runbook for "SSH to all switches unreachable simultaneously" — a scenario that required pivoting to console access. The first 7 minutes of the response were spent attempting SSH and confirming it was completely unavailable, rather than immediately dispatching to console access. The runbook gap meant this was improvised rather than procedural.

Action Items¶

ID	Action	Priority	Owner	Status	Due Date
AI-001	Implement dedicated out-of-band management network: dedicated VLAN on a physically separate switch pair, backed by cellular uplink; migrate all switch management IPs to OOB VLAN; complete hardware procurement	P0	Sanjay Patel (Datacenter Engineering)	In Progress	2025-11-14
AI-002	Switch Juniper advisory monitoring from monthly manual review to weekly automated scan: subscribe to Juniper Security Advisories RSS feed, route new advisories to NetOps Slack channel and create Jira tickets automatically for any advisory matching active firmware versions	P0	Owen Mackenzie (NetOps)	Not Started	2025-09-19
AI-003	Immediately downgrade `core-sw-01`, `core-sw-02`, `core-sw-03` to 21.4R3-S2 and freeze firmware upgrades until JSA-2025-0044 is resolved in a patched release; schedule firmware re-upgrade review for 2025-10-01	P0	Owen Mackenzie (NetOps)	Complete	2025-09-04
AI-004	Write and publish runbook for "in-band SSH to all switches unreachable": escalation path → console server access → physical console access; include console server IP, cellular APN credentials, and serial port map for all production switches	P1	Owen Mackenzie (NetOps)	Not Started	2025-09-19
AI-005	Resume Q4 core switch topology remediation project: add `core-sw-04` to form an ECMP triangle for C/F-row racks, eliminating the single-pair bottleneck; schedule for completion by 2025-12-15	P1	Datacenter Engineering + Infrastructure Leadership	Not Started	2025-12-15
AI-006	Add second on-call engineer to NetOps and Datacenter Engineering rotation for the 18:00–06:00 UTC window; current single-person on-call cannot parallelize multi-switch console remediation	P2	Infrastructure Leadership (Valentina Cruz)	Not Started	2025-10-01

Lessons Learned¶

The management plane must be physically separated from the data plane. When management traffic runs on the same infrastructure as production traffic, any data plane failure that degrades the network will also take out the management plane. This is not a new principle — it is foundational to datacenter network design. The consequence of ignoring it is exactly what occurred: all switches became simultaneously unmanageable, turning a recoverable firmware issue into a multi-hour physical console operation. Out-of-band management networks are not optional infrastructure; they are a prerequisite for safe operations.
Firmware advisory cadence must match the blast radius of the affected systems. Core network infrastructure has a blast radius that is disproportionate to its software surface area. A firmware bug on a core switch can partition the entire datacenter. Advisory review for core network firmware should be weekly or continuous — not monthly. The asymmetry between the 16-day advisory publication lag and the 30-day review cycle created an unnecessary 36-day exposure window.
Documented risks that are not mitigated are accepted risks — make sure leadership accepts them explicitly. The STP topology fragility was documented in the Q1 infrastructure review and rated medium priority. It was then deferred. At the point of deferral, leadership implicitly accepted the risk of a TCN storm partition event. Making that acceptance explicit — "we are accepting the risk of a full C/F row partition until Q4 2025, estimated probability X, estimated impact $Y" — creates accountability and may change the prioritization decision.

Cross-References¶

Failure Pattern: Infrastructure — Firmware Bug in Core Network Device; Design Flaw — Management Plane Co-Located with Data Plane; Process Gap — Vendor Advisory Review Cadence Too Slow
Topic Packs: spanning-tree-protocol, datacenter-networking, firmware-lifecycle-management, out-of-band-management
Runbook: runbooks/network/core-switch-console-recovery.md (to be created per AI-004)
Decision Tree: Triage path — "Multiple synthetic checks failing simultaneously, SSH to switches unreachable" → immediately dispatch physical on-site engineer to console server → do not spend >5 min attempting in-band SSH → check STP state via console → if TCN storm, identify firmware version and cross-reference advisory database → initiate firmware downgrade