Mellanox Switches¶
Mellanox (now NVIDIA Networking) switches are widely deployed in high-performance datacenter fabrics, providing low-latency InfiniBand and Ethernet connectivity. Understanding their CLI, firmware management, and RDMA configuration is essential for anyone managing HPC clusters or high-throughput network infrastructure.
Why this matters¶
In HPC and AI/ML environments, network fabric performance directly impacts job completion times. A firmware mismatch or misconfigured subnet manager can silently degrade throughput or cause intermittent failures that are extremely difficult to diagnose without deep familiarity with the platform.
Prerequisites¶
Familiarity with basic networking concepts (VLANs, MTU, IP routing) and datacenter hardware operations.
Key concepts covered¶
- Switch OS options: Onyx (MLNX-OS) vs Cumulus Linux vs SONiC
- InfiniBand fundamentals: subnet managers, partitions, and QoS
- RDMA and RoCE: when and why to use remote direct memory access over Ethernet
- Firmware lifecycle: upgrade procedures, version compatibility, and rollback
Contents¶
Start with the primer to understand the hardware and OS landscape, then move to operational recipes and pitfalls.
| # | File | What it covers |
|---|---|---|
| 1 | Primer | Switch models, Onyx/Cumulus OS, InfiniBand vs Ethernet modes, and RDMA basics |
| 2 | Street Ops | Firmware upgrades, port configuration, diagnostics, and cable troubleshooting |
| 3 | Footguns & Pitfalls | Firmware mismatches, subnet manager conflicts, and MTU traps |