Incident Replay: Network Bonding Failover Not Working¶
Setup¶
- System context: Production database server with two 10GbE NICs in an active-backup bond (bond0). Connected to redundant ToR switches for high availability.
- Time: Wednesday 03:22 UTC
- Your role: On-call SRE / Linux systems engineer
Round 1: Alert Fires¶
[Pressure cue: "PagerDuty fires — database server unreachable. Automated failover to replica triggered. You have 10 minutes before SLA breach."]
What you see: The primary database server lost network connectivity. Monitoring shows the server went dark 2 minutes ago. A switch maintenance was scheduled on ToR switch A, which should have triggered bond failover to switch B.
Choose your action: - A) SSH to the replica and promote it to primary immediately - B) Access the server via iDRAC/IPMI console to check bond status - C) Check the switch B port status from the network team's dashboard - D) Reboot the server remotely via IPMI to force a fresh bond negotiation
If you chose A:¶
[Result: Replica is already promoted (auto-failover). But you still need to fix the primary to restore HA. This does not address the root cause.]
If you chose B (recommended):¶
[Result: iDRAC console shows the server OS is running.
cat /proc/net/bonding/bond0shows both slave interfaces are down. The backup NIC did not take over. Proceed to Round 2.]
If you chose C:¶
[Result: Switch B shows the port as "up/up" with link detected. So the physical layer is fine on the switch side. Partial clue — eventually leads to Round 2.]
If you chose D:¶
[Result: Server reboots, bond comes up on switch B's NIC. But you have not found root cause — it will fail again next maintenance. Band-aid fix.]
Round 2: First Triage Data¶
[Pressure cue: "Database auto-failover succeeded but the primary is still offline. HA is degraded until it is back."]
What you see:
Bond mode is active-backup. Primary slave is eth0 (switch A — currently down for maintenance). The backup slave eth1 shows MII Status: down even though the switch port is up.
Choose your action:
- A) Check ethtool eth1 for link status and driver details
- B) Restart the networking service to re-initialize the bond
- C) Change the bond mode to balance-rr to use both links
- D) Check for cable issues on eth1
If you chose A (recommended):¶
[Result:
ethtool eth1shows "Link detected: yes" but the driver is loaded with incorrect firmware.dmesg | grep eth1reveals "firmware mismatch — NIC in degraded mode." The NIC firmware was not updated during the last patching cycle. Proceed to Round 3.]
If you chose B:¶
[Result: Service restart brings bond0 up with eth1 as active slave. Connectivity restored temporarily, but NIC firmware is still mismatched. Fragile. Eventually leads to Round 3 data.]
If you chose C:¶
[Result: Changing bond mode requires taking the bond down — extended outage. Also does not fix the underlying NIC issue.]
If you chose D:¶
[Result: Cable is fine — switch shows link up, and ethtool shows link detected. 10 minutes spent on a physical walkdown for nothing.]
Round 3: Root Cause Identification¶
[Pressure cue: "CTO asks why our 'redundant' network setup had a single point of failure."]
What you see: Root cause: eth1's NIC firmware was not updated in the last maintenance cycle. The mismatched firmware put the NIC in a degraded mode where it reported link but could not pass traffic. The bond driver saw MII status as down because the NIC was not forwarding packets.
Choose your action: - A) Update eth1 firmware and restart the bond - B) Replace eth1 with a spare NIC card - C) Force the bond to use eth1 despite the MII status - D) Configure the bond to use ARP monitoring instead of MII
If you chose A (recommended):¶
[Result: Firmware updated via
ethtool -i eth1to confirm version, then flash tool applied. Bond reconfigured, both slaves active. Connectivity restored with full HA. Proceed to Round 4.]
If you chose B:¶
[Result: NIC replacement works but takes 30+ minutes (shutdown, physical swap, driver load, bond reconfiguration). Overkill when firmware update would fix it.]
If you chose C:¶
[Result: Bond with a non-functional slave causes packet loss. Worse than the original problem.]
If you chose D:¶
[Result: ARP monitoring would detect the issue better than MII but does not fix the firmware problem. The NIC still cannot pass traffic reliably.]
Round 4: Remediation¶
[Pressure cue: "Server is back and HA is restored. Close the incident."]
Actions:
1. Verify bond status: cat /proc/net/bonding/bond0 — both slaves active
2. Test failover: ifdown eth0 && sleep 5 && ifup eth0 — verify seamless switch
3. Add NIC firmware version to the automated patching checklist
4. Implement ARP monitoring in addition to MII for defense in depth
5. Schedule firmware audit across all bonded servers in the fleet
Damage Report¶
- Total downtime: 8 minutes (auto-failover to replica covered the gap)
- Blast radius: Primary database offline; replica served traffic with slightly higher latency
- Optimal resolution time: 15 minutes (diagnose bond -> identify firmware -> update -> verify)
- If every wrong choice was made: 75+ minutes plus risk of data inconsistency during extended split
Cross-References¶
- Primer: Datacenter & Server Hardware
- Primer: Linux Ops
- Primer: LACP
- Footguns: Networking