Skip to content

Incident Replay: Link Flaps from Bad Optic

Setup

  • System context: Production server connected via 10GbE SFP+ optic to a ToR switch. The server handles east-west traffic for a distributed cache cluster.
  • Time: Wednesday 19:15 UTC
  • Your role: On-call SRE / datacenter technician

Round 1: Alert Fires

[Pressure cue: "Monitoring fires — cache cluster node cache-07 has intermittent connectivity. Cache hit rate dropping across the cluster. 5 minutes to auto-escalation."]

What you see: dmesg on cache-07 shows repeated NIC link up/down events every 10-30 seconds. Network monitoring shows the switch port also flapping. The server is intermittently reachable.

Choose your action: - A) Restart the network service on cache-07 - B) Check ethtool eth0 for link status and error counters - C) Remove cache-07 from the cluster to stop the impact - D) Check the switch logs for port error details

If you chose A:

[Result: Network service restarts, link comes up, flaps again after 15 seconds. Not a software issue.]

[Result: ethtool -S eth0 shows high CRC error count and rx_errors incrementing rapidly. ethtool -m eth0 shows SFP+ module temperature at 78C (threshold: 70C) and Rx power at -18 dBm (threshold: -14 dBm). Optic is overheating and signal is weak. Proceed to Round 2.]

If you chose C:

[Result: Removes the impact but does not fix the problem. Cluster is now running at reduced capacity.]

If you chose D:

[Result: Switch logs show "%LINK-3-UPDOWN: Interface GigabitEthernet1/0/7, changed state to down" repeating. Confirms flapping but does not identify cause. 5 minutes to coordinate.]

Round 2: First Triage Data

[Pressure cue: "Cache cluster is operating at 92% capacity with cache-07 flapping. Performance degradation is noticeable."]

What you see: The SFP+ optic is a third-party module, not a vendor-certified part. DOM (Digital Optical Monitoring) shows the module is overheating and Rx power is below threshold. Adjacent ports are stable.

Choose your action: - A) Replace the SFP+ optic with a certified spare - B) Clean the fiber connectors and re-seat the optic - C) Move the cable to a different switch port - D) Reduce the link speed to 1GbE to reduce power/heat

[Result: You clean the fiber ends with an IBC cleaner and re-seat the SFP+. Link comes up stable for 2 minutes then flaps again. The cleaning helped Rx power slightly but the optic temperature is still climbing. The module itself is failing. Proceed to Round 3.]

If you chose A:

[Result: Correct fix but you should verify it is not a dirty connector first — swapping the optic when a $2 cleaning would fix it wastes a $100 module. Good instinct though.]

If you chose C:

[Result: Moving the cable moves the problem. The flapping follows the optic, not the port. 10 minutes wasted.]

If you chose D:

[Result: 1GbE is insufficient bandwidth for cache traffic. Application performance would be worse than the flapping.]

Round 3: Root Cause Identification

[Pressure cue: "Confirm the root cause so we can close this."]

What you see: Root cause: Third-party SFP+ module is failing — overheating and producing weak signal. The optic passed initial testing at install but degraded over 6 months. Uncertified optics have no vendor support for warranty replacement.

Choose your action: - A) Replace with a vendor-certified SFP+ module - B) Replace with another third-party module from the spare bin - C) Replace and also audit all third-party optics in the rack - D) Replace and implement DOM-based monitoring alerts

[Result: Certified optic installed. Link is stable, DOM readings nominal. Audit reveals 8 more third-party optics in the rack — schedule proactive replacement. Also add DOM alerting. Proceed to Round 4.]

If you chose A:

[Result: Fixes this server but you have 8 more ticking time bombs in the rack.]

If you chose B:

[Result: Another third-party module from the same batch may have the same failure mode. You are rolling the dice.]

If you chose D:

[Result: DOM alerting is excellent for early warning but does not address the 8 other third-party optics.]

Round 4: Remediation

[Pressure cue: "Link is stable. Verify and close."]

Actions: 1. Verify link stability: ethtool eth0 — link up, no errors for 10+ minutes 2. Verify DOM readings: ethtool -m eth0 — temperature and Rx power within thresholds 3. Rejoin cache-07 to the cluster and verify cache hit rate recovers 4. Schedule replacement of remaining third-party optics 5. Add DOM threshold alerts to monitoring (temperature > 65C, Rx power < -12 dBm)

Damage Report

  • Total downtime: 0 (server was intermittently available; cluster absorbed the impact)
  • Blast radius: Cache cluster degraded; 8% capacity reduction during flapping
  • Optimal resolution time: 15 minutes (diagnose with ethtool -> clean -> replace optic -> verify)
  • If every wrong choice was made: 45+ minutes plus extended cluster degradation

Cross-References