Mellanox Switches — Footguns¶

Mistakes that cause outages, packet loss, or silent performance degradation with Mellanox/NVIDIA Networking switches.

1. Firmware Upgrade Without Reading Release Notes¶

Mellanox firmware upgrades are not always seamless. Some versions deprecate CLI commands, change default behavior, or require an intermediate version hop. Upgrading directly from an old version to the latest can leave the switch in a broken state.

What happens: Switch boots into the new firmware but features are misconfigured or missing. In MLAG setups, version mismatch between peers can break the MLAG domain.

Fix: Always read the release notes for every version between your current and target. Check the "upgrade path" section. Upgrade one MLAG peer at a time and verify reconvergence before touching the other.

2. PFC Misconfiguration Causing Head-of-Line Blocking or PFC Storms¶

Priority Flow Control (PFC) is the mechanism that makes Ethernet lossless for RDMA/RoCE. But PFC is a blunt instrument — when a priority queue fills up, PFC pauses the sender. If PFC is misconfigured, one slow receiver can back-pressure an entire fabric.

What happens: A storage node falls behind processing RDMA writes. PFC pause frames propagate upstream, eventually pausing traffic on spine ports. Unrelated workloads on the same fabric experience packet loss or stalls. This is a PFC storm.

Fix: - Enable PFC watchdog (priority-flow-control watchdog enable) — it detects stuck PFC states and disables PFC on the affected port. - Scope PFC to only the priority class that needs it (typically priority 3 for RoCE). Never enable PFC on all priorities. - Use ECN alongside PFC so senders back off before buffers fill.

3. DCBX Trust Mode Mismatch Between Switch and NIC¶

The switch and the NIC must agree on how to classify traffic into priority queues. If the switch trusts L2 CoS bits but the NIC is marking DSCP (or vice versa), RDMA traffic ends up in the wrong queue. PFC does not protect it, and you get drops under load.

What happens: RDMA works fine at low utilization. Under heavy load, you see packet drops and retransmissions. RoCE performance drops to TCP-like levels. WJH shows buffer congestion on non-lossless queues.

Fix: Verify trust mode is consistent end-to-end:

# On the switch
show dcb priority-flow-control

# On the NIC (Linux host)
mlnx_qos -i eth0

Both must agree: either trust L2 everywhere or trust DSCP everywhere. Most modern deployments use trust DSCP.

4. Forgetting `configuration write`¶

Onyx, like most network OS platforms, has separate running and startup configurations. If you make changes and do not run configuration write, those changes are lost on the next reload.

What happens: You spend an hour tuning PFC thresholds and ECN parameters. The switch reboots (power event, firmware upgrade, crash). It comes back with the old configuration. Your lossless fabric is no longer lossless.

Fix: Always configuration write after making changes. Automate config backups so you can detect drift between running and startup configs.

5. MLAG Split-Brain from ISL Link Failure¶

MLAG (Multi-Chassis Link Aggregation) requires an Inter-Switch Link (ISL) between the two peer switches. If the ISL fails, both switches think they are the primary and independently forward traffic.

What happens: Duplicate packets on downstream hosts. MAC table flapping. STP topology changes cascading through the fabric. Applications see duplicate deliveries or connection resets.

Fix: - Use a dedicated, redundant ISL (two physical links in a port-channel, ideally on different line cards or front-panel groups). - Configure MLAG heartbeat over the management network as a backup keepalive. - Test ISL failure scenarios in maintenance windows so you know the blast radius.

6. WJH Not Enabled — Drops Invisible Until You Turn It On¶

What Just Happened (WJH) is the most powerful debugging tool on a Mellanox switch. It logs every hardware-level packet drop with the reason. But on some firmware versions, WJH forwarding and WJH ACL channels are not enabled by default.

What happens: Packets are being dropped in hardware, but show what-just-happened returns nothing. You spend hours troubleshooting at the application layer when the answer was one CLI command away.

Fix: Enable all WJH channels:

what-just-happened auto-export enable
what-just-happened forwarding enable
what-just-happened acl enable

Add these to your base configuration template for every switch.

7. Non-Qualified Optics Silently Degraded¶

Mellanox switches validate transceiver vendor and part number against a qualified optics list. Non-qualified optics may work but can be silently degraded — speed-limited, DOM readings unavailable, or intermittent link flaps.

What happens: You install third-party optics to save cost. The link comes up but at reduced speed or with intermittent CRC errors. show interfaces ethernet 1/1 transceiver shows "unsupported" or "not qualified."

Fix: - Use NVIDIA-qualified optics for production links. The cost difference is small compared to the debugging time for intermittent optical issues. - If you must use third-party optics, test thoroughly under load and monitor DOM readings. - Some firmware versions allow overriding the qualification check, but this is unsupported and voids warranty coverage on that port.

8. REST API Enabled Without Authentication in Lab, Deployed to Production¶

The Onyx REST API provides full read-write access to switch configuration. In lab environments, it is common to disable authentication for convenience. If that configuration is copied to production, anyone with network access to the management interface can reconfigure the switch.

What happens: An attacker or misconfigured automation tool hits the REST API and modifies switch configuration — disabling ports, changing routing, or exfiltrating the running config (which may contain SNMP community strings or other credentials).

Fix: - Always require authentication on the REST API (web https enable, configure users with strong passwords). - Restrict management interface access with ACLs — only allow known management subnets. - Use HTTPS, not HTTP, for the REST API. - Audit REST API access logs regularly.

9. Running MLAG with Mismatched Firmware Versions¶

MLAG peers must run the same firmware version (or at least versions within the same supported compatibility range). Mismatched versions can cause subtle protocol disagreements.

What happens: MLAG appears healthy (show mlag shows "operational"), but under certain traffic patterns, one peer makes different forwarding decisions than the other. You see intermittent packet loss or asymmetric load distribution.

Fix: Always upgrade both MLAG peers to the same firmware version. Upgrade one at a time (ISSU), verify convergence, then upgrade the second. Check release notes for MLAG compatibility between versions.

10. Ignoring Buffer Utilization Until It Is Too Late¶

Spectrum ASICs use shared memory buffers. This is a strength (any port can use available buffer space) but also means one congested port can consume buffer space that other ports need.

What happens: A storage burst fills the shared buffer. PFC kicks in and pauses senders. But because the buffer is shared, the congestion spills over and affects unrelated traffic on other ports.

Fix: - Monitor buffer utilization proactively: show interfaces ethernet counters buffer - Configure per-port or per-priority buffer limits to prevent one flow from hogging all buffer space. - Use WJH and ECN together for early warning — ECN marks packets before buffers are full, giving senders time to back off.

11. Oversubscription Assumptions with Breakout Cables¶

Spectrum switches support port breakout (e.g., one 100G QSFP28 port into 4x 25G). But breakout does not increase total bandwidth — it splits the existing bandwidth across more ports.

What happens: You break out 32x 100G ports into 128x 25G ports and connect 128 servers. Each server gets 25G, but the switch still has 3.2 Tbps of fabric capacity. If all servers burst simultaneously, you hit oversubscription that did not exist with the original 32-port config.

Fix: Calculate oversubscription ratios before deploying breakout cables. For storage fabrics (which need non-blocking), avoid breakout or use it only on low-traffic management ports.

12. Not Monitoring Temperature in High-Density Racks¶

Mellanox switches run hot, especially the Spectrum-3 and Spectrum-4 models. In a fully loaded rack with inadequate airflow, switch ASICs will thermal-throttle, reducing throughput.

What happens: During a heatwave or cooling failure, the switch silently reduces port speeds to protect itself. You see throughput degradation that looks like a network issue but is actually thermal.

Fix: - Monitor show environment temperatures via SNMP and alert when ASIC temperature exceeds 85C. - Ensure proper hot/cold aisle separation and adequate airflow for network switches. - Spectrum-4 (SN5000) switches require careful rack planning — they pull significant wattage and generate proportional heat.