Solution: Bonding Failover Not Working¶
Triage¶
- Examine the bond status:
cat /proc/net/bonding/bond0-- check mode, slaves, link monitoring parametersip -d link show bond0-- check bond parameters at kernel levelip link show eno1 ; ip link show eno2-- check interface states- Check link detection:
ethtool eno1 | grep -i link-- check if ethtool detects link downethtool eno2 | grep -i link-- same for backup- Review the bond configuration:
nmcli con show bond0-- check all bond parameters in NetworkManager- Look specifically for
bond.options-- ismiimonpresent?
Root Cause¶
The bond is configured as mode=active-backup (mode 1), which is correct for failover. However, the miimon parameter is set to 0, which means link monitoring is completely disabled. Without link monitoring, the bonding driver has no mechanism to detect that eno1's link has gone down. It continues to send all traffic through the "active" slave (eno1) even though the switch port is shut.
The interface shows "UP" in ip link because the kernel link state and the bonding driver's link monitoring are independent subsystems. The kernel may detect carrier loss (NO-CARRIER), but without miimon the bonding driver never checks.
The switch port configuration also reveals a secondary issue: both ports are configured as individual access ports in VLAN 200, which is correct for active-backup mode. There is no LACP/port-channel mismatch.
Fix¶
- Enable MII link monitoring:
miimon=100polls link status every 100ms- This is non-disruptive if done during a maintenance window
- Optionally add ARP monitoring as a secondary check:
arp_interval=1000witharp_ip_target=10.20.0.1(gateway IP)- ARP monitoring detects upper-layer failures (e.g., switch ACL blocking traffic but link still up)
- Note:
miimonandarp_intervalare mutually exclusive in the bonding driver; choose one - Test failover:
- Shut down the switch port for eno1
- Verify bond0 detects the failure within 200-300ms and switches to eno2
- Confirm continuous ping shows at most 1-2 dropped packets
- Re-enable the switch port; verify eno1 becomes the active slave again (due to
primary=eno1) - Persist the configuration:
- Verify the setting persists across reboots:
nmcli con show bond0 | grep bond.options - Test with a server reboot during the maintenance window
Rollback / Safety¶
- If the bond goes down during the
nmcli con down/upcycle, physical console access is needed. - Schedule this during a maintenance window; the bond restart causes a brief (1-2s) outage.
- If
miimon=100causes false positives (flapping), increase tomiimon=200ormiimon=500.
Common Traps¶
- Assuming bonds "just work": A bond without link monitoring is just a static mapping; it has no failover capability. Always set
miimonorarp_interval. - Confusing bond mode with switch config: Active-backup (mode 1) does NOT require switch-side configuration. LACP (mode 4) does. Using LACP on the switch with active-backup on the server will not work.
- Using
arp_intervalwithmiimon: They are mutually exclusive. Setting both will cause unpredictable behavior. - Not testing failover: Many teams configure bonds and never test them. Schedule periodic failover tests.
- NetworkManager interference: Older NM versions can fight with manual bond configs. Always use
nmclito configure bonds on RHEL 8+.