Incident Replay: LACP Mismatch — One Link Hot¶
Setup¶
- System context: Server with a 2-port LACP bond to a switch stack. All traffic is going over one link instead of both. The second link shows 'up' but carries no traffic.
- Time: Monday 09:00 UTC
- Your role: Network engineer
Round 1: Alert Fires¶
[Pressure cue: "Monitoring shows server web-prod-01 using only 50% of expected network bandwidth. One NIC at 100% utilization, the other at 0%."]
What you see:
cat /proc/net/bonding/bond0 shows both slaves as up, but all traffic flows through eth0. ethtool -S eth1 shows 0 tx/rx packets. LACP shows both links in the bundle.
Choose your action: - A) Restart networking on the server - B) Check the LACP hashing algorithm and switch LACP configuration - C) Replace the cable on eth1 - D) Change the bond mode to balance-rr
If you chose A:¶
[Result: Restart re-initializes the bond but traffic still only uses eth0. Same result.]
If you chose B (recommended):¶
[Result: Server bond mode is 802.3ad (LACP) with
xmit_hash_policy=layer2. Switch shows LACP mode as 'passive' on one port and 'active' on the other. The passive port never initiated LACP negotiation, and while the link shows 'up' in the bundle, the switch is not distributing traffic to it. Proceed to Round 2.]
If you chose C:¶
[Result: Cable is fine — link is up. The issue is LACP negotiation, not physical.]
If you chose D:¶
[Result: balance-rr without switch support causes duplicate/out-of-order packets. Wrong fix.]
Round 2: First Triage Data¶
[Pressure cue: "Problem scoped. Apply the fix."]
What you see: Root cause from Round 1 narrows the investigation. You need to apply the correct fix and verify.
Choose your action: - A) Apply the quick targeted fix - B) Apply the comprehensive fix with verification - C) Apply a workaround while planning the proper fix - D) Escalate to a specialist team
If you chose A:¶
[Result: Quick fix resolves the immediate issue but may not be durable. Proceed cautiously.]
If you chose B (recommended):¶
[Result: Comprehensive fix applied with verification steps. Issue resolved. Proceed to Round 3.]
If you chose C:¶
[Result: Workaround buys time but the root cause remains. Acceptable short-term.]
If you chose D:¶
[Result: Specialist is unavailable or adds delay. Try the fix yourself first.]
Round 3: Root Cause Identification¶
[Pressure cue: "Fix applied. Document root cause and prevention."]
What you see: Root cause is confirmed. Process or configuration gap that allowed this to happen is identified.
Choose your action: - A) Fix the specific instance only - B) Fix the instance and add monitoring - C) Fix the instance, add monitoring, and update procedures - D) Comprehensive: fix + monitor + procedure + automation
If you chose D (recommended):¶
[Result: All layers addressed. Immediate fix, detection, process, and automation. Proceed to Round 4.]
If you chose A:¶
[Result: Fixes this case but the same mistake can recur.]
If you chose B:¶
[Result: Better detection next time but does not prevent recurrence.]
If you chose C:¶
[Result: Good coverage but automation reduces human error further.]
Round 4: Remediation¶
[Pressure cue: "Service restored. Verify and close."]
Actions: 1. Verify the service is functioning correctly 2. Verify monitoring detects the fix 3. Update runbooks and procedures 4. Schedule follow-up actions (automation, infrastructure changes) 5. Close the incident with a post-mortem
Damage Report¶
- Total downtime: Varies based on path taken
- Blast radius: Affected service and dependent systems
- Optimal resolution time: 15 minutes
- If every wrong choice was made: 60 minutes + additional damage
Cross-References¶
- Primer: Networking
- Primer: Networking Troubleshooting
- Footguns: Networking