Skip to content

Solution: Link Flaps - Bad Optic

Triage

  1. Confirm the link flap pattern on the server:

    dmesg -T | grep -i "eth2.*link"
    
    Note the frequency and pattern of up/down transitions.

  2. Check current link state and speed negotiation:

    ethtool eth2
    ip link show eth2
    

  3. Read SFP+ module diagnostics (Digital Optical Monitoring):

    ethtool -m eth2
    
    Key values: Laser output power (TX), Receiver signal (RX), module temperature, supply voltage. Compare against the module's rated thresholds.

  4. Check interface error counters:

    ethtool -S eth2 | grep -i "err\|crc\|drop"
    

  5. Check the switch side:

    show interface Eth1/15
    show interface Eth1/15 counters errors
    show logging | include Eth1/15
    

Root Cause

The DAC cable installed 2 weeks ago is defective. The ethtool -m output shows the TX signal power is at -8.2 dBm, well below the minimum threshold of -5.0 dBm for this cable type. As the cable degrades (temperature cycling, marginal connection), the signal drops below the receiver sensitivity threshold, causing the link to drop. When the link re-negotiates, it briefly works before dropping again.

Switch-side logs confirm CRC errors incrementing on the port, consistent with marginal signal quality.

Fix

  1. Reseat the cable first (both ends) to rule out a seating issue:
  2. Pull and firmly re-insert the DAC cable at the server NIC and switch port.
  3. Wait 5 minutes and monitor for continued flapping.

  4. If flapping continues, replace the cable:

  5. Swap with a known-good DAC cable or SFP+ optic + fiber pair.
  6. If using third-party optics on Cisco, ensure compatibility or use service unsupported-transceiver on the switch.

  7. After cable swap, verify:

    # Server side
    ethtool eth2                    # Link detected: yes, Speed: 10000Mb/s
    ethtool -m eth2                 # TX/RX power within normal range
    ethtool -S eth2 | grep crc     # CRC errors should stop incrementing
    
    # Switch side
    clear counters interface Eth1/15
    # Wait 5 min, then:
    show interface Eth1/15 counters errors   # Should show zero new errors
    

  8. If a bond/team interface exists, the failover should have handled the link-down events:

    cat /proc/net/bonding/bond0
    
    Verify the bond is healthy and both members are active after the fix.

  9. Monitor for 24 hours to confirm stability.

Rollback / Safety

  • Cable replacement is non-destructive. If the new cable also flaps, the issue may be the NIC port or switch port.
  • To test the switch port, move the cable to an adjacent unused port and update the switch config.
  • To test the server NIC, try a different SFP+ slot if available.
  • If a maintenance window is needed, coordinate with the storage team since replication will pause during the swap.

Common Traps

  • Trap: Assuming it's a switch configuration issue and spending hours on port config. Always check physical layer first for intermittent link issues.
  • Trap: Not checking ethtool -m for SFP+ diagnostics. This is the fastest way to identify a bad optic/cable.
  • Trap: Replacing the cable but not checking the SFP+ module. If using an optic + fiber (not DAC), the SFP+ transceiver itself could be the failing component.
  • Trap: Ignoring the switch-side err-disable feature. If the switch detects too many flaps, it may shut the port. Check show interface status err-disabled.
  • Trap: Not labeling and quarantining the bad cable. If it goes back into spares inventory, someone will reuse it and hit the same problem.