Solution: Bonding Failover Not Working¶

Triage¶

Examine the bond status:
cat /proc/net/bonding/bond0 -- check mode, slaves, link monitoring parameters
ip -d link show bond0 -- check bond parameters at kernel level
ip link show eno1 ; ip link show eno2 -- check interface states
Check link detection:
ethtool eno1 | grep -i link -- check if ethtool detects link down
ethtool eno2 | grep -i link -- same for backup
Review the bond configuration:
nmcli con show bond0 -- check all bond parameters in NetworkManager
Look specifically for bond.options -- is miimon present?

Root Cause¶

The bond is configured as mode=active-backup (mode 1), which is correct for failover. However, the miimon parameter is set to 0, which means link monitoring is completely disabled. Without link monitoring, the bonding driver has no mechanism to detect that eno1's link has gone down. It continues to send all traffic through the "active" slave (eno1) even though the switch port is shut.

The interface shows "UP" in ip link because the kernel link state and the bonding driver's link monitoring are independent subsystems. The kernel may detect carrier loss (NO-CARRIER), but without miimon the bonding driver never checks.

The switch port configuration also reveals a secondary issue: both ports are configured as individual access ports in VLAN 200, which is correct for active-backup mode. There is no LACP/port-channel mismatch.

Fix¶

Enable MII link monitoring:

nmcli con modify bond0 bond.options "mode=active-backup,miimon=100,primary=eno1"
nmcli con down bond0 && nmcli con up bond0

miimon=100 polls link status every 100ms
This is non-disruptive if done during a maintenance window
Optionally add ARP monitoring as a secondary check:
arp_interval=1000 with arp_ip_target=10.20.0.1 (gateway IP)
ARP monitoring detects upper-layer failures (e.g., switch ACL blocking traffic but link still up)
Note: miimon and arp_interval are mutually exclusive in the bonding driver; choose one
Test failover:
Shut down the switch port for eno1
Verify bond0 detects the failure within 200-300ms and switches to eno2
Confirm continuous ping shows at most 1-2 dropped packets
Re-enable the switch port; verify eno1 becomes the active slave again (due to primary=eno1)
Persist the configuration:
Verify the setting persists across reboots: nmcli con show bond0 | grep bond.options
Test with a server reboot during the maintenance window

Rollback / Safety¶

If the bond goes down during the nmcli con down/up cycle, physical console access is needed.
Schedule this during a maintenance window; the bond restart causes a brief (1-2s) outage.
If miimon=100 causes false positives (flapping), increase to miimon=200 or miimon=500.

Common Traps¶

Assuming bonds "just work": A bond without link monitoring is just a static mapping; it has no failover capability. Always set miimon or arp_interval.
Confusing bond mode with switch config: Active-backup (mode 1) does NOT require switch-side configuration. LACP (mode 4) does. Using LACP on the switch with active-backup on the server will not work.
Using arp_interval with miimon: They are mutually exclusive. Setting both will cause unpredictable behavior.
Not testing failover: Many teams configure bonds and never test them. Schedule periodic failover tests.
NetworkManager interference: Older NM versions can fight with manual bond configs. Always use nmcli to configure bonds on RHEL 8+.