Skip to content

Mellanox Switches

Mellanox (now NVIDIA Networking) switches are widely deployed in high-performance datacenter fabrics, providing low-latency InfiniBand and Ethernet connectivity. Understanding their CLI, firmware management, and RDMA configuration is essential for anyone managing HPC clusters or high-throughput network infrastructure.

Why this matters

In HPC and AI/ML environments, network fabric performance directly impacts job completion times. A firmware mismatch or misconfigured subnet manager can silently degrade throughput or cause intermittent failures that are extremely difficult to diagnose without deep familiarity with the platform.

Prerequisites

Familiarity with basic networking concepts (VLANs, MTU, IP routing) and datacenter hardware operations.

Key concepts covered

  • Switch OS options: Onyx (MLNX-OS) vs Cumulus Linux vs SONiC
  • InfiniBand fundamentals: subnet managers, partitions, and QoS
  • RDMA and RoCE: when and why to use remote direct memory access over Ethernet
  • Firmware lifecycle: upgrade procedures, version compatibility, and rollback

Contents

Start with the primer to understand the hardware and OS landscape, then move to operational recipes and pitfalls.

# File What it covers
1 Primer Switch models, Onyx/Cumulus OS, InfiniBand vs Ethernet modes, and RDMA basics
2 Street Ops Firmware upgrades, port configuration, diagnostics, and cable troubleshooting
3 Footguns & Pitfalls Firmware mismatches, subnet manager conflicts, and MTU traps