Skip to content

Portal | Level: L2: Operations | Topics: LACP / Link Aggregation, Server Hardware, Linux Networking Tools | Domain: Datacenter & Hardware

Scenario: NIC Flapping in LACP Bond

Situation

At 14:22 PM, the network monitoring dashboard shows intermittent packet loss (15-40%) on web-lb-02, a load balancer handling production HTTPS traffic. The on-call engineer sees the bond interface bond0 repeatedly losing and re-adding its member interfaces every 30-90 seconds. Application teams are reporting sporadic 502 errors from the upstream load balancer. This started after a switch maintenance window last night.

What You Know

  • web-lb-02 has two 25GbE NICs (ens3f0 and ens3f1) bonded in LACP (802.3ad) mode
  • The NICs connect to two different top-of-rack switches (TOR-A and TOR-B) in an MLAG pair
  • A switch firmware upgrade and config push happened last night on both TOR switches
  • Before the maintenance, the bond had been stable for 6+ months
  • The host OS (RHEL 8) was not changed

Investigation Steps

1. Check Bond Status and LACP Partner State

Command(s):

# View bond status and member states
cat /proc/net/bonding/bond0

# Check which interfaces are currently active
ip link show bond0
ip link show ens3f0
ip link show ens3f1

# Watch bond state changes in real time
journalctl -u NetworkManager -f --no-pager | grep -i bond

# Check kernel bond messages
dmesg | grep -iE "bond|lacp|ens3f"
What to look for: In /proc/net/bonding/bond0, check Partner Mac Address for each slave -- if it shows 00:00:00:00:00:00, the switch is not responding to LACP PDUs. Look at MII Status (should be "up") and LACP rate (slow = every 30s, fast = every 1s). If the Aggregator ID differs between slaves, they are not in the same LAG from the switch perspective. In dmesg, look for bond0: link status definitely down/up messages showing the flap pattern.

Command(s):

# Verify link speed and duplex on both NICs
ethtool ens3f0 | grep -E "Speed|Duplex|Link detected|Auto-negotiation"
ethtool ens3f1 | grep -E "Speed|Duplex|Link detected|Auto-negotiation"

# Check for link errors, CRC errors, or drops
ethtool -S ens3f0 | grep -iE "error|drop|crc|fcs|align"
ethtool -S ens3f1 | grep -iE "error|drop|crc|fcs|align"

# Check NIC driver and firmware version
ethtool -i ens3f0

# Verify no physical layer issues
ethtool --phy-statistics ens3f0 2>/dev/null
What to look for: Both NICs should show 25000Mb/s, Full duplex. If one shows a lower speed or "Link detected: no" intermittently, it could be a cable/optic issue rather than LACP. Rising rx_crc_errors or rx_fcs_errors indicate a physical layer problem (bad cable, dirty optic, failing transceiver). If link is solid but LACP PDUs are not being exchanged, the issue is logical, not physical.

3. Verify LACP Configuration Matches Switch Side

Command(s):

# Check the LACP configuration on the Linux side
cat /proc/net/bonding/bond0 | grep -A5 "802.3ad info"

# Verify bond configuration
cat /etc/sysconfig/network-scripts/ifcfg-bond0
# Or for newer systems using NetworkManager
nmcli connection show bond0 | grep -E "bond\.|802"

# Capture LACP PDUs to see what the switch is sending
sudo tcpdump -i ens3f0 -vv ether proto 0x8809 -c 5

# Compare with the other NIC
sudo tcpdump -i ens3f1 -vv ether proto 0x8809 -c 5
What to look for: The tcpdump output shows raw LACPDU frames. Compare Actor (switch) and Partner (host) info. Check that System (MAC), Key, and Port values are consistent. If the switches were reconfigured and the LACP system-id or key changed, the bond sees them as different systems and cannot form an aggregation. Also verify the LACP timeout (short vs long) matches between host and switch -- a mismatch causes one side to time out and declare the link dead.

Root Cause

During the switch maintenance, the configuration push changed the LACP system priority and LACP rate on the TOR switches from fast (1-second PDU interval) to slow (30-second PDU interval). The Linux bond was configured with lacp_rate=fast, expecting LACP PDUs every second. When PDUs stopped arriving within the 3-second fast timeout (3 missed PDUs), the bond driver declared each member link as failed and removed it from the aggregate. When the next slow-interval PDU arrived 30 seconds later, the link was re-added. This created a continuous flap cycle: link up for a few seconds, declared down after 3 seconds of silence, re-added 30 seconds later when the next PDU arrived.

Fix

Immediate:

# Option 1: Change the Linux bond to match the switch (slow LACP rate)
# This is the fastest fix if you cannot change the switch config right now
sudo ip link set bond0 down
echo "slow" | sudo tee /sys/class/net/bond0/bonding/lacp_rate
sudo ip link set bond0 up

# Verify the change took effect
cat /proc/net/bonding/bond0 | grep "LACP rate"

# Make it persistent (RHEL/CentOS)
sudo nmcli connection modify bond0 bond.options "mode=802.3ad,lacp_rate=slow,miimon=100,xmit_hash_policy=layer3+4"
sudo nmcli connection up bond0

# Option 2 (preferred): Ask the network team to restore fast LACP on the switches
# This is the better long-term fix since fast LACP detects failures in 3 seconds vs 90 seconds

Preventive: - Document LACP parameters (rate, system priority, hash policy) in a shared runbook that both network and server teams reference before maintenance - Add LACP PDU rate monitoring -- alert if LACP partner timeout changes unexpectedly - Use configuration management for switch configs (e.g., Ansible network modules) with pre/post change diffs - Include server-side bond validation in the post-maintenance checklist for switch upgrades - Set up bond state monitoring that alerts on member link flaps, not just total bond failure

Common Mistakes

  • Assuming it is a cable problem because the link keeps going up and down -- LACP flapping looks like a physical issue but is almost always a configuration mismatch
  • Restarting NetworkManager or rebooting the server -- this does not fix a switch-side LACP config mismatch
  • Only checking one side -- LACP requires both the host and the switch to agree on parameters; you must verify both
  • Forgetting to check the MLAG/vPC peer-link between the two TOR switches -- if that link failed during maintenance, the switches present different system IDs and the bond cannot aggregate across them
  • Not capturing LACP PDUs with tcpdump -- this is the fastest way to see exactly what the switch is advertising

Interview Angle

Q: A bonded network interface is flapping. How would you troubleshoot it? Good answer shape: Start by checking /proc/net/bonding/bond0 to see the bond mode, member states, and partner information. Use ethtool to rule out physical link issues (speed mismatch, CRC errors, link detection). Then capture LACP PDUs with tcpdump -i <iface> ether proto 0x8809 to see what the switch is actually sending. The most common cause of LACP flapping is a parameter mismatch between host and switch -- especially LACP rate (fast vs slow) or system priority changes after switch maintenance. Mention that you would coordinate with the network team to verify switch-side config. A great answer also mentions checking MLAG/vPC status between paired switches.


Wiki Navigation

Prerequisites