Skip to content

Solution: Bonding Failover Not Working

Triage

  1. Examine the bond status:
  2. cat /proc/net/bonding/bond0 -- check mode, slaves, link monitoring parameters
  3. ip -d link show bond0 -- check bond parameters at kernel level
  4. ip link show eno1 ; ip link show eno2 -- check interface states
  5. Check link detection:
  6. ethtool eno1 | grep -i link -- check if ethtool detects link down
  7. ethtool eno2 | grep -i link -- same for backup
  8. Review the bond configuration:
  9. nmcli con show bond0 -- check all bond parameters in NetworkManager
  10. Look specifically for bond.options -- is miimon present?

Root Cause

The bond is configured as mode=active-backup (mode 1), which is correct for failover. However, the miimon parameter is set to 0, which means link monitoring is completely disabled. Without link monitoring, the bonding driver has no mechanism to detect that eno1's link has gone down. It continues to send all traffic through the "active" slave (eno1) even though the switch port is shut.

The interface shows "UP" in ip link because the kernel link state and the bonding driver's link monitoring are independent subsystems. The kernel may detect carrier loss (NO-CARRIER), but without miimon the bonding driver never checks.

The switch port configuration also reveals a secondary issue: both ports are configured as individual access ports in VLAN 200, which is correct for active-backup mode. There is no LACP/port-channel mismatch.

Fix

  1. Enable MII link monitoring:
    nmcli con modify bond0 bond.options "mode=active-backup,miimon=100,primary=eno1"
    nmcli con down bond0 && nmcli con up bond0
    
  2. miimon=100 polls link status every 100ms
  3. This is non-disruptive if done during a maintenance window
  4. Optionally add ARP monitoring as a secondary check:
  5. arp_interval=1000 with arp_ip_target=10.20.0.1 (gateway IP)
  6. ARP monitoring detects upper-layer failures (e.g., switch ACL blocking traffic but link still up)
  7. Note: miimon and arp_interval are mutually exclusive in the bonding driver; choose one
  8. Test failover:
  9. Shut down the switch port for eno1
  10. Verify bond0 detects the failure within 200-300ms and switches to eno2
  11. Confirm continuous ping shows at most 1-2 dropped packets
  12. Re-enable the switch port; verify eno1 becomes the active slave again (due to primary=eno1)
  13. Persist the configuration:
  14. Verify the setting persists across reboots: nmcli con show bond0 | grep bond.options
  15. Test with a server reboot during the maintenance window

Rollback / Safety

  • If the bond goes down during the nmcli con down/up cycle, physical console access is needed.
  • Schedule this during a maintenance window; the bond restart causes a brief (1-2s) outage.
  • If miimon=100 causes false positives (flapping), increase to miimon=200 or miimon=500.

Common Traps

  • Assuming bonds "just work": A bond without link monitoring is just a static mapping; it has no failover capability. Always set miimon or arp_interval.
  • Confusing bond mode with switch config: Active-backup (mode 1) does NOT require switch-side configuration. LACP (mode 4) does. Using LACP on the switch with active-backup on the server will not work.
  • Using arp_interval with miimon: They are mutually exclusive. Setting both will cause unpredictable behavior.
  • Not testing failover: Many teams configure bonds and never test them. Schedule periodic failover tests.
  • NetworkManager interference: Older NM versions can fight with manual bond configs. Always use nmcli to configure bonds on RHEL 8+.