LACP Footguns¶

Mistakes that cause bond failures, silent single-link operation, or network outages.

1. Using LACP mode without configuring the switch¶

You create a mode 4 (802.3ad) bond on the host. The switch ports are still in individual access mode. LACP PDUs go unanswered, the bond forms but only one link is active, and you think you have redundancy when you do not.

Fix: LACP requires switch-side configuration. Verify with cat /proc/net/bonding/bond0 — all members must show the same Aggregator ID. Coordinate with the network team.

Debug clue: In /proc/net/bonding/bond0, look for "Aggregator ID" on each slave. If slaves have different Aggregator IDs, they are in separate aggregation groups and not actually bonded. Also check "Partner Mac Address" — if it shows 00:00:00:00:00:00, the switch is not responding to LACP PDUs at all.

2. Both sides set to LACP passive¶

Both the host and the switch are in LACP passive mode. Neither side initiates LACP PDU exchange. The bond never forms. Traffic may still flow on individual links, masking the problem.

Fix: At least one side must be active. Set the host to active: ip link set bond0 type bond lacp_rate fast. Verify partner info appears in /proc/net/bonding/bond0.

3. Speed/duplex mismatch between bond members¶

One NIC negotiates 10G, the other 1G. LACP requires all members to have the same speed and duplex. The mismatched member gets a different Aggregator ID and is not bundled. You see "2 members" but only one carries traffic.

Fix: Check ethtool on each member. Ensure identical speed/duplex. Replace cables or SFPs if one link is not negotiating correctly.

4. Not setting miimon (link monitoring interval)¶

You create a bond but leave miimon at 0 (disabled). A link goes down physically. The bond driver does not notice for minutes because it is not polling link state. Traffic continues to be hashed to the dead link.

Fix: Always set miimon 100 (100ms polling). For LACP bonds, also rely on LACP PDU timeout (3x interval) for faster detection.

Default trap: miimon defaults to 0 (disabled) in the bonding driver. With lacp_rate slow (default), LACP PDUs are sent every 30 seconds, so detection takes 90 seconds (3x). Use lacp_rate fast for 1-second PDUs and 3-second detection.

5. Using balance-rr (mode 0) without understanding packet reordering¶

Mode 0 round-robins packets across links. TCP expects packets in order. Out-of-order delivery triggers retransmissions and dramatically reduces throughput — sometimes worse than a single link.

Fix: Use mode 4 (802.3ad) for LACP with layer3+4 hash policy. A single flow stays on one link, but aggregate bandwidth benefits from many flows.

Gotcha: TCP retransmissions from out-of-order delivery on mode 0 can reduce effective throughput to 30-40% of a single link. The retransmit overhead is worse than having no bond at all. This is why mode 0 is almost never the right choice for TCP-heavy workloads.

6. Wrong hash policy causing traffic imbalance¶

You use layer2 hash policy. In environments with a single gateway MAC, all traffic hashes to the same link because the destination MAC is always the gateway. Your 2x10G bond operates as 1x10G.

Fix: Use layer3+4 (source/dest IP + port) for IP-based traffic. This distributes across flows, not MAC addresses.

7. Changing bond configuration while traffic is flowing¶

You modify the bond mode or hash policy on a live bond with active connections. The kernel rebuilds the bond, all member links are briefly removed and re-added. Active connections are disrupted. TCP sessions reset.

Fix: Schedule a maintenance window. If using nmcli, bring the bond down before changing options. If changing hash policy only, some changes can be applied live but test first.

8. Forgetting to set primary on active-backup bonds¶

With active-backup (mode 1) and no primary set, the kernel picks the first available member. After a failover and recovery, the bond may not return to the preferred link. If one link has better performance or monitoring, traffic stays on the suboptimal path.

Fix: Set the primary explicitly: ip link set bond0 type bond primary eth0. Use primary_reselect always to always return to the primary when it recovers.

9. Assuming the bond doubles throughput for a single connection¶

You configure a 2x10G bond and expect 20 Gbps for a single file transfer. A single TCP flow is hashed to one link — you get 10 Gbps maximum. Management asks why the "20G bond" only does 10G. This is not a misconfiguration; it is how hashing works.

Fix: Set expectations correctly. Bonding adds bandwidth for many concurrent flows and adds redundancy for all traffic. A single flow will never exceed a single link's speed.

10. Not monitoring bond state in production¶

The bond is configured and works at deployment. Six months later, a cable is disturbed during rack maintenance. One member goes down. The bond fails over silently. Nobody notices until the second link fails and there is a full outage with no redundancy.

Fix: Monitor bond member count and state. Alert when any member is down: cat /proc/net/bonding/bond0 | grep "MII Status: down". Add this to your Prometheus node exporter or Nagios checks.

War story: Dell documented a case where LACP bond ports on Data Domain appliances silently fell out of the link aggregation group, causing degraded performance and data unavailability. No alerts fired because bond-level monitoring only checked "bond0 is up" — not whether all members were active. The bond was running on a single link for weeks before anyone noticed.