Portal | Level: L2: Operations | Topics: LACP / Link Aggregation, Server Hardware, Linux Networking Tools | Domain: Datacenter & Hardware
Scenario: NIC Flapping in LACP Bond¶
Situation¶
At 14:22 PM, the network monitoring dashboard shows intermittent packet loss (15-40%) on web-lb-02, a load balancer handling production HTTPS traffic. The on-call engineer sees the bond interface bond0 repeatedly losing and re-adding its member interfaces every 30-90 seconds. Application teams are reporting sporadic 502 errors from the upstream load balancer. This started after a switch maintenance window last night.
What You Know¶
web-lb-02has two 25GbE NICs (ens3f0 and ens3f1) bonded in LACP (802.3ad) mode- The NICs connect to two different top-of-rack switches (TOR-A and TOR-B) in an MLAG pair
- A switch firmware upgrade and config push happened last night on both TOR switches
- Before the maintenance, the bond had been stable for 6+ months
- The host OS (RHEL 8) was not changed
Investigation Steps¶
1. Check Bond Status and LACP Partner State¶
Command(s):
# View bond status and member states
cat /proc/net/bonding/bond0
# Check which interfaces are currently active
ip link show bond0
ip link show ens3f0
ip link show ens3f1
# Watch bond state changes in real time
journalctl -u NetworkManager -f --no-pager | grep -i bond
# Check kernel bond messages
dmesg | grep -iE "bond|lacp|ens3f"
/proc/net/bonding/bond0, check Partner Mac Address for each slave -- if it shows 00:00:00:00:00:00, the switch is not responding to LACP PDUs. Look at MII Status (should be "up") and LACP rate (slow = every 30s, fast = every 1s). If the Aggregator ID differs between slaves, they are not in the same LAG from the switch perspective. In dmesg, look for bond0: link status definitely down/up messages showing the flap pattern.
2. Check Physical Link and NIC Negotiation¶
Command(s):
# Verify link speed and duplex on both NICs
ethtool ens3f0 | grep -E "Speed|Duplex|Link detected|Auto-negotiation"
ethtool ens3f1 | grep -E "Speed|Duplex|Link detected|Auto-negotiation"
# Check for link errors, CRC errors, or drops
ethtool -S ens3f0 | grep -iE "error|drop|crc|fcs|align"
ethtool -S ens3f1 | grep -iE "error|drop|crc|fcs|align"
# Check NIC driver and firmware version
ethtool -i ens3f0
# Verify no physical layer issues
ethtool --phy-statistics ens3f0 2>/dev/null
rx_crc_errors or rx_fcs_errors indicate a physical layer problem (bad cable, dirty optic, failing transceiver). If link is solid but LACP PDUs are not being exchanged, the issue is logical, not physical.
3. Verify LACP Configuration Matches Switch Side¶
Command(s):
# Check the LACP configuration on the Linux side
cat /proc/net/bonding/bond0 | grep -A5 "802.3ad info"
# Verify bond configuration
cat /etc/sysconfig/network-scripts/ifcfg-bond0
# Or for newer systems using NetworkManager
nmcli connection show bond0 | grep -E "bond\.|802"
# Capture LACP PDUs to see what the switch is sending
sudo tcpdump -i ens3f0 -vv ether proto 0x8809 -c 5
# Compare with the other NIC
sudo tcpdump -i ens3f1 -vv ether proto 0x8809 -c 5
tcpdump output shows raw LACPDU frames. Compare Actor (switch) and Partner (host) info. Check that System (MAC), Key, and Port values are consistent. If the switches were reconfigured and the LACP system-id or key changed, the bond sees them as different systems and cannot form an aggregation. Also verify the LACP timeout (short vs long) matches between host and switch -- a mismatch causes one side to time out and declare the link dead.
Root Cause¶
During the switch maintenance, the configuration push changed the LACP system priority and LACP rate on the TOR switches from fast (1-second PDU interval) to slow (30-second PDU interval). The Linux bond was configured with lacp_rate=fast, expecting LACP PDUs every second. When PDUs stopped arriving within the 3-second fast timeout (3 missed PDUs), the bond driver declared each member link as failed and removed it from the aggregate. When the next slow-interval PDU arrived 30 seconds later, the link was re-added. This created a continuous flap cycle: link up for a few seconds, declared down after 3 seconds of silence, re-added 30 seconds later when the next PDU arrived.
Fix¶
Immediate:
# Option 1: Change the Linux bond to match the switch (slow LACP rate)
# This is the fastest fix if you cannot change the switch config right now
sudo ip link set bond0 down
echo "slow" | sudo tee /sys/class/net/bond0/bonding/lacp_rate
sudo ip link set bond0 up
# Verify the change took effect
cat /proc/net/bonding/bond0 | grep "LACP rate"
# Make it persistent (RHEL/CentOS)
sudo nmcli connection modify bond0 bond.options "mode=802.3ad,lacp_rate=slow,miimon=100,xmit_hash_policy=layer3+4"
sudo nmcli connection up bond0
# Option 2 (preferred): Ask the network team to restore fast LACP on the switches
# This is the better long-term fix since fast LACP detects failures in 3 seconds vs 90 seconds
Preventive: - Document LACP parameters (rate, system priority, hash policy) in a shared runbook that both network and server teams reference before maintenance - Add LACP PDU rate monitoring -- alert if LACP partner timeout changes unexpectedly - Use configuration management for switch configs (e.g., Ansible network modules) with pre/post change diffs - Include server-side bond validation in the post-maintenance checklist for switch upgrades - Set up bond state monitoring that alerts on member link flaps, not just total bond failure
Common Mistakes¶
- Assuming it is a cable problem because the link keeps going up and down -- LACP flapping looks like a physical issue but is almost always a configuration mismatch
- Restarting NetworkManager or rebooting the server -- this does not fix a switch-side LACP config mismatch
- Only checking one side -- LACP requires both the host and the switch to agree on parameters; you must verify both
- Forgetting to check the MLAG/vPC peer-link between the two TOR switches -- if that link failed during maintenance, the switches present different system IDs and the bond cannot aggregate across them
- Not capturing LACP PDUs with tcpdump -- this is the fastest way to see exactly what the switch is advertising
Interview Angle¶
Q: A bonded network interface is flapping. How would you troubleshoot it?
Good answer shape: Start by checking /proc/net/bonding/bond0 to see the bond mode, member states, and partner information. Use ethtool to rule out physical link issues (speed mismatch, CRC errors, link detection). Then capture LACP PDUs with tcpdump -i <iface> ether proto 0x8809 to see what the switch is actually sending. The most common cause of LACP flapping is a parameter mismatch between host and switch -- especially LACP rate (fast vs slow) or system priority changes after switch maintenance. Mention that you would coordinate with the network team to verify switch-side config. A great answer also mentions checking MLAG/vPC status between paired switches.
Wiki Navigation¶
Prerequisites¶
- Datacenter & Server Hardware (Topic Pack, L1)
- Networking Deep Dive (Topic Pack, L1)
Related Content¶
- Networking Deep Dive (Topic Pack, L1) — LACP / Link Aggregation, Linux Networking Tools
- Bare-Metal Provisioning (Topic Pack, L2) — Server Hardware
- Case Study: API Latency Spike — BGP Route Leak, Fix Is Network ACL (Case Study, L2) — Linux Networking Tools
- Case Study: ARP Flux Duplicate IP (Case Study, L2) — Linux Networking Tools
- Case Study: BIOS Settings Reset After CMOS (Case Study, L1) — Server Hardware
- Case Study: Bonding Failover Not Working (Case Study, L1) — LACP / Link Aggregation
- Case Study: Cable Management Wrong Port (Case Study, L1) — Server Hardware
- Case Study: DHCP Relay Broken (Case Study, L1) — Linux Networking Tools
- Case Study: Database Replication Lag — Root Cause Is RAID Degradation (Case Study, L2) — Server Hardware
- Case Study: Duplex Mismatch Symptoms (Case Study, L1) — Linux Networking Tools