Solution: Link Flaps - Bad Optic¶
Triage¶
-
Confirm the link flap pattern on the server:
Note the frequency and pattern of up/down transitions. -
Check current link state and speed negotiation:
-
Read SFP+ module diagnostics (Digital Optical Monitoring):
Key values: Laser output power (TX), Receiver signal (RX), module temperature, supply voltage. Compare against the module's rated thresholds. -
Check interface error counters:
-
Check the switch side:
Root Cause¶
The DAC cable installed 2 weeks ago is defective. The ethtool -m output shows the TX signal power is at -8.2 dBm, well below the minimum threshold of -5.0 dBm for this cable type. As the cable degrades (temperature cycling, marginal connection), the signal drops below the receiver sensitivity threshold, causing the link to drop. When the link re-negotiates, it briefly works before dropping again.
Switch-side logs confirm CRC errors incrementing on the port, consistent with marginal signal quality.
Fix¶
- Reseat the cable first (both ends) to rule out a seating issue:
- Pull and firmly re-insert the DAC cable at the server NIC and switch port.
-
Wait 5 minutes and monitor for continued flapping.
-
If flapping continues, replace the cable:
- Swap with a known-good DAC cable or SFP+ optic + fiber pair.
-
If using third-party optics on Cisco, ensure compatibility or use
service unsupported-transceiveron the switch. -
After cable swap, verify:
# Server side ethtool eth2 # Link detected: yes, Speed: 10000Mb/s ethtool -m eth2 # TX/RX power within normal range ethtool -S eth2 | grep crc # CRC errors should stop incrementing # Switch side clear counters interface Eth1/15 # Wait 5 min, then: show interface Eth1/15 counters errors # Should show zero new errors -
If a bond/team interface exists, the failover should have handled the link-down events:
Verify the bond is healthy and both members are active after the fix. -
Monitor for 24 hours to confirm stability.
Rollback / Safety¶
- Cable replacement is non-destructive. If the new cable also flaps, the issue may be the NIC port or switch port.
- To test the switch port, move the cable to an adjacent unused port and update the switch config.
- To test the server NIC, try a different SFP+ slot if available.
- If a maintenance window is needed, coordinate with the storage team since replication will pause during the swap.
Common Traps¶
- Trap: Assuming it's a switch configuration issue and spending hours on port config. Always check physical layer first for intermittent link issues.
- Trap: Not checking
ethtool -mfor SFP+ diagnostics. This is the fastest way to identify a bad optic/cable. - Trap: Replacing the cable but not checking the SFP+ module. If using an optic + fiber (not DAC), the SFP+ transceiver itself could be the failing component.
- Trap: Ignoring the switch-side
err-disablefeature. If the switch detects too many flaps, it may shut the port. Checkshow interface status err-disabled. - Trap: Not labeling and quarantining the bad cable. If it goes back into spares inventory, someone will reuse it and hit the same problem.