Skip to content

Mellanox Switches — Primer

Why This Matters

Mellanox (now NVIDIA Networking) switches are the backbone of modern high-performance datacenter fabrics. If you work in any environment running HPC, AI/ML training clusters, or large-scale storage fabrics, you will encounter Spectrum-based switches running Onyx (MLNX-OS). These are not commodity switches — they are purpose-built for lossless, low-latency Ethernet with native RDMA/RoCE support. Knowing how to operate them is the difference between a fabric that delivers line-rate RDMA and one that drops packets silently.

Unlike Cisco or Arista where the CLI is broadly similar, Onyx has its own conventions, its own firmware lifecycle, and its own tooling (WJH, UFM). If you approach a Mellanox switch expecting IOS-XE or EOS behavior, you will get bitten.

Core Concepts

1. Mellanox to NVIDIA Networking Lineage

Who made it: Mellanox Technologies was founded in 1999 in Yokneam, Israel by Eyal Waldman and three co-founders. The name "Mellanox" is derived from the Hebrew word "mellanox" meaning "from the south." The company specialized in InfiniBand and high-speed Ethernet interconnects. NVIDIA acquired Mellanox in April 2020 for $6.9 billion — at the time, the largest acquisition in NVIDIA's history — to own the datacenter networking stack alongside its GPU compute business.

Mellanox Technologies was acquired by NVIDIA in 2020. The switch product line continues under the "NVIDIA Networking" brand, but the operating system is still called Onyx (formerly MLNX-OS). In practice, you will see all three names used interchangeably in documentation, firmware filenames, and vendor conversations:

  • Mellanox — the original company and still the name on older hardware labels
  • MLNX-OS — the original OS name, still appears in firmware filenames
  • Onyx — the current marketing name for the switch OS
  • NVIDIA Networking — the corporate umbrella post-acquisition

The ConnectX NIC line and BlueField DPU line are separate products. This topic covers switches only.

2. Spectrum ASIC Family

Every Mellanox switch is built around a Spectrum ASIC. The ASIC generation determines throughput, port density, buffer depth, and supported features:

ASIC Launched Max Port Speed Typical Models Use Case
Spectrum (SN2000) 2016 100G SN2100, SN2410, SN2700 ToR, leaf
Spectrum-2 (SN3000) 2019 200G SN3420, SN3700 Spine, leaf, storage
Spectrum-3 (SN4000) 2021 400G SN4600, SN4700 Spine, AI fabric
Spectrum-4 (SN5000) 2023 800G SN5600 AI/HPC spine, ultra-scale

Key points: - Each generation roughly doubles bandwidth density - All Spectrum ASICs support hardware-based WJH (What Just Happened) - Buffer architecture is shared memory (not per-port) — better for bursty traffic - Spectrum-3 and later add enhanced telemetry and deeper packet inspection

Under the hood: WJH (What Just Happened) is a hardware-level packet drop recording feature unique to Spectrum ASICs. When a packet is dropped — whether by ACL, buffer overflow, or routing miss — the ASIC records the reason, the packet header, and a timestamp directly in hardware. This eliminates the "where did the packet go?" guessing game that plagues debugging on other switch platforms. The command show what-just-happened is the single most powerful debugging tool on Onyx.

3. Onyx (MLNX-OS) CLI Fundamentals

Onyx uses a two-mode CLI similar to Cisco IOS but with its own command vocabulary:

Operational mode — view state, run show commands:

switch > show interfaces ethernet 1/1 status
switch > show ip route
switch > show lldp interfaces

Configuration mode — make changes:

switch > enable
switch # configure terminal
switch (config) # interface ethernet 1/1
switch (config interface ethernet 1/1) # speed 100G auto
switch (config interface ethernet 1/1) # no shutdown
switch (config interface ethernet 1/1) # exit
switch (config) # configuration write

Critical difference from Cisco: the command is configuration write, not write memory or copy running-config startup-config.

Gotcha: Cisco muscle memory is the #1 source of errors on Mellanox switches. write memory does not exist in Onyx — you must use configuration write. Similarly, show run works but the output format differs from IOS. Interface naming uses ethernet 1/1 (not GigabitEthernet0/1), and MLAG configuration has completely different syntax from Cisco vPC. Do not assume command equivalence.

Default trap: Onyx interfaces default to admin-down on most firmware versions. New installs require explicit no shutdown on every interface — unlike some Cisco platforms where interfaces come up by default.

Key show commands:

Command Purpose
show interfaces ethernet status All port states, speed, link
show interfaces ethernet counters Packet/byte/error counters
show ip route Routing table
show ip bgp summary BGP neighbor states
show lldp interfaces LLDP neighbor discovery
show mac-address-table Learned MACs
show what-just-happened WJH packet drop log
show mlag MLAG domain status
show version Firmware, uptime, model

4. Switch Platform Models and Use Cases

Mellanox switches slot into datacenter fabrics at specific tiers:

Top-of-Rack (ToR) / Leaf: - SN2100 (16x 100G) — compact 1U, half-width option, popular in storage clusters - SN2700 (32x 100G) — standard leaf for Clos fabrics - SN3420 (48x 25G + 12x 100G) — server-facing leaf with high port density

Spine: - SN3700 (32x 200G) — Spectrum-2 spine - SN4600 (64x 100G) — high-radix Spectrum-3 spine - SN4700 (32x 400G) — AI/HPC spine

Storage Fabric: - SN2100 and SN2700 are common in dedicated storage networks (NVMe-oF, RDMA) - Low latency and lossless Ethernet are the selling points here

5. Licensing Model and Feature Tiers

Onyx licensing is simpler than Cisco or Arista:

  • Base license — included with hardware, covers L2/L3 switching, BGP, OSPF, VXLAN, MLAG, RDMA/RoCE, WJH, REST API, SNMP
  • Enhanced license — adds features like NAT, PBR (policy-based routing), advanced ACL counters
  • UFM license — separate product for fabric-wide management (see below)

Most datacenter deployments run on the base license. The feature set is generous compared to competitors where equivalent features require expensive add-on licenses.

Interview tip: A common question for datacenter network roles: "Why Mellanox over Cisco or Arista?" The key differentiators are: (1) native lossless Ethernet/RoCE support baked into the ASIC, (2) WJH for hardware-level drop analysis, (3) generous base licensing (BGP, VXLAN, RDMA included), and (4) shared buffer architecture that handles micro-bursts better than per-port buffered switches. For AI/ML training clusters, the RDMA/RoCE support is the deciding factor.

6. Management Interfaces

CLI (SSH/console): Primary operational interface. Serial console for initial setup and recovery.

REST API: Onyx exposes a JSON REST API on the management interface:

# Get interface status
curl -k -u admin:password https://switch-mgmt/admin/launch?script=rh&template=json-request \
  -d '{"cmd": "show interfaces ethernet 1/1 status"}'

The REST API is useful for automation but has quirks — it wraps CLI commands rather than exposing a native object model.

SNMP: Standard SNMP v2c/v3 support. MIBs are available from NVIDIA support portal.

UFM (Unified Fabric Manager): Centralized management platform for multi-switch fabrics. Provides topology discovery, health monitoring, firmware orchestration, and telemetry aggregation. Runs as a separate appliance or VM — it is not part of the switch OS.

Syslog/Streaming Telemetry: Syslog to remote collectors. Newer firmware supports gNMI-based streaming telemetry for real-time counter export.

Quick Reference

# Check firmware and hardware
show version

# Interface status at a glance
show interfaces ethernet status

# Error counters (look for CRC, input errors, drops)
show interfaces ethernet counters errors

# LLDP neighbors (who is connected to what)
show lldp interfaces

# Routing table
show ip route vrf default

# BGP summary
show ip bgp summary

# MLAG status
show mlag

# WJH — what packets were dropped and why
show what-just-happened

# Save running config
configuration write

# Firmware version and uptime
show version

# Transceiver diagnostics
show interfaces ethernet 1/1 transceiver

# Temperature and fans
show environment

Wiki Navigation

Prerequisites

  • Mellanox Switches Flashcards (CLI) (flashcard_deck, L1) — Mellanox Switches