Incident Replay: Link Flaps from Bad Optic¶
Setup¶
- System context: Production server connected via 10GbE SFP+ optic to a ToR switch. The server handles east-west traffic for a distributed cache cluster.
- Time: Wednesday 19:15 UTC
- Your role: On-call SRE / datacenter technician
Round 1: Alert Fires¶
[Pressure cue: "Monitoring fires — cache cluster node cache-07 has intermittent connectivity. Cache hit rate dropping across the cluster. 5 minutes to auto-escalation."]
What you see:
dmesg on cache-07 shows repeated NIC link up/down events every 10-30 seconds. Network monitoring shows the switch port also flapping. The server is intermittently reachable.
Choose your action:
- A) Restart the network service on cache-07
- B) Check ethtool eth0 for link status and error counters
- C) Remove cache-07 from the cluster to stop the impact
- D) Check the switch logs for port error details
If you chose A:¶
[Result: Network service restarts, link comes up, flaps again after 15 seconds. Not a software issue.]
If you chose B (recommended):¶
[Result:
ethtool -S eth0shows high CRC error count and rx_errors incrementing rapidly.ethtool -m eth0shows SFP+ module temperature at 78C (threshold: 70C) and Rx power at -18 dBm (threshold: -14 dBm). Optic is overheating and signal is weak. Proceed to Round 2.]
If you chose C:¶
[Result: Removes the impact but does not fix the problem. Cluster is now running at reduced capacity.]
If you chose D:¶
[Result: Switch logs show "%LINK-3-UPDOWN: Interface GigabitEthernet1/0/7, changed state to down" repeating. Confirms flapping but does not identify cause. 5 minutes to coordinate.]
Round 2: First Triage Data¶
[Pressure cue: "Cache cluster is operating at 92% capacity with cache-07 flapping. Performance degradation is noticeable."]
What you see: The SFP+ optic is a third-party module, not a vendor-certified part. DOM (Digital Optical Monitoring) shows the module is overheating and Rx power is below threshold. Adjacent ports are stable.
Choose your action: - A) Replace the SFP+ optic with a certified spare - B) Clean the fiber connectors and re-seat the optic - C) Move the cable to a different switch port - D) Reduce the link speed to 1GbE to reduce power/heat
If you chose B (recommended):¶
[Result: You clean the fiber ends with an IBC cleaner and re-seat the SFP+. Link comes up stable for 2 minutes then flaps again. The cleaning helped Rx power slightly but the optic temperature is still climbing. The module itself is failing. Proceed to Round 3.]
If you chose A:¶
[Result: Correct fix but you should verify it is not a dirty connector first — swapping the optic when a $2 cleaning would fix it wastes a $100 module. Good instinct though.]
If you chose C:¶
[Result: Moving the cable moves the problem. The flapping follows the optic, not the port. 10 minutes wasted.]
If you chose D:¶
[Result: 1GbE is insufficient bandwidth for cache traffic. Application performance would be worse than the flapping.]
Round 3: Root Cause Identification¶
[Pressure cue: "Confirm the root cause so we can close this."]
What you see: Root cause: Third-party SFP+ module is failing — overheating and producing weak signal. The optic passed initial testing at install but degraded over 6 months. Uncertified optics have no vendor support for warranty replacement.
Choose your action: - A) Replace with a vendor-certified SFP+ module - B) Replace with another third-party module from the spare bin - C) Replace and also audit all third-party optics in the rack - D) Replace and implement DOM-based monitoring alerts
If you chose C (recommended):¶
[Result: Certified optic installed. Link is stable, DOM readings nominal. Audit reveals 8 more third-party optics in the rack — schedule proactive replacement. Also add DOM alerting. Proceed to Round 4.]
If you chose A:¶
[Result: Fixes this server but you have 8 more ticking time bombs in the rack.]
If you chose B:¶
[Result: Another third-party module from the same batch may have the same failure mode. You are rolling the dice.]
If you chose D:¶
[Result: DOM alerting is excellent for early warning but does not address the 8 other third-party optics.]
Round 4: Remediation¶
[Pressure cue: "Link is stable. Verify and close."]
Actions:
1. Verify link stability: ethtool eth0 — link up, no errors for 10+ minutes
2. Verify DOM readings: ethtool -m eth0 — temperature and Rx power within thresholds
3. Rejoin cache-07 to the cluster and verify cache hit rate recovers
4. Schedule replacement of remaining third-party optics
5. Add DOM threshold alerts to monitoring (temperature > 65C, Rx power < -12 dBm)
Damage Report¶
- Total downtime: 0 (server was intermittently available; cluster absorbed the impact)
- Blast radius: Cache cluster degraded; 8% capacity reduction during flapping
- Optimal resolution time: 15 minutes (diagnose with ethtool -> clean -> replace optic -> verify)
- If every wrong choice was made: 45+ minutes plus extended cluster degradation
Cross-References¶
- Primer: Datacenter & Server Hardware
- Primer: Networking
- Footguns: Datacenter
- Footguns: Networking