Solution: Link Flaps - Bad Optic¶

Triage¶

Confirm the link flap pattern on the server:
```
dmesg -T | grep -i "eth2.*link"
```
Note the frequency and pattern of up/down transitions.
Check current link state and speed negotiation:
```
ethtool eth2
ip link show eth2
```
Read SFP+ module diagnostics (Digital Optical Monitoring):
```
ethtool -m eth2
```
Key values: Laser output power (TX), Receiver signal (RX), module temperature, supply voltage. Compare against the module's rated thresholds.

Check interface error counters:

ethtool -S eth2 | grep -i "err\|crc\|drop"

Check the switch side:

show interface Eth1/15
show interface Eth1/15 counters errors
show logging | include Eth1/15

Root Cause¶

The DAC cable installed 2 weeks ago is defective. The ethtool -m output shows the TX signal power is at -8.2 dBm, well below the minimum threshold of -5.0 dBm for this cable type. As the cable degrades (temperature cycling, marginal connection), the signal drops below the receiver sensitivity threshold, causing the link to drop. When the link re-negotiates, it briefly works before dropping again.

Switch-side logs confirm CRC errors incrementing on the port, consistent with marginal signal quality.

Fix¶

Reseat the cable first (both ends) to rule out a seating issue:
Pull and firmly re-insert the DAC cable at the server NIC and switch port.
Wait 5 minutes and monitor for continued flapping.
If flapping continues, replace the cable:
Swap with a known-good DAC cable or SFP+ optic + fiber pair.
If using third-party optics on Cisco, ensure compatibility or use service unsupported-transceiver on the switch.

After cable swap, verify:

# Server side
ethtool eth2                    # Link detected: yes, Speed: 10000Mb/s
ethtool -m eth2                 # TX/RX power within normal range
ethtool -S eth2 | grep crc     # CRC errors should stop incrementing

# Switch side
clear counters interface Eth1/15
# Wait 5 min, then:
show interface Eth1/15 counters errors   # Should show zero new errors

If a bond/team interface exists, the failover should have handled the link-down events:
```
cat /proc/net/bonding/bond0
```
Verify the bond is healthy and both members are active after the fix.
Monitor for 24 hours to confirm stability.

Rollback / Safety¶

Cable replacement is non-destructive. If the new cable also flaps, the issue may be the NIC port or switch port.
To test the switch port, move the cable to an adjacent unused port and update the switch config.
To test the server NIC, try a different SFP+ slot if available.
If a maintenance window is needed, coordinate with the storage team since replication will pause during the swap.

Common Traps¶

Trap: Assuming it's a switch configuration issue and spending hours on port config. Always check physical layer first for intermittent link issues.
Trap: Not checking ethtool -m for SFP+ diagnostics. This is the fastest way to identify a bad optic/cable.
Trap: Replacing the cable but not checking the SFP+ module. If using an optic + fiber (not DAC), the SFP+ transceiver itself could be the failing component.
Trap: Ignoring the switch-side err-disable feature. If the switch detects too many flaps, it may shut the port. Check show interface status err-disabled.
Trap: Not labeling and quarantining the bad cable. If it goes back into spares inventory, someone will reuse it and hit the same problem.