Skip to content

Mellanox Switches — Street-Level Ops

Real-world patterns and debugging techniques for Mellanox/NVIDIA Networking switches in production.

Quick Diagnosis Commands

The first five commands you run when something is wrong:

Remember: show what-just-happened (WJH) is Mellanox's killer feature. It logs every hardware-level packet drop with a reason code. No other vendor makes this as easy. Always check WJH first when troubleshooting drops.

show interfaces ethernet status               # Link up/down, speed negotiation
show what-just-happened                        # Hardware-level drop reasons
show interfaces ethernet counters errors       # CRC, runts, giants, input errors
show mlag                                      # MLAG domain health, split-brain check
show environment                               # Thermals, fans, PSU

Common Scenarios

Symptoms: Interface shows "down" or "no carrier" despite cable being seated.

Diagnosis steps:

# Check physical link state
show interfaces ethernet 1/1 status

# Check transceiver — is it detected, qualified?
show interfaces ethernet 1/1 transceiver

# Check speed/auto-negotiation settings
show interfaces ethernet 1/1

# Check for admin shutdown
show running-config interface ethernet 1/1

Resolution: - If transceiver shows "not present" — reseat the optic. Check it is the right type (QSFP28 for 100G, QSFP56 for 200G, QSFP-DD for 400G). - If transceiver shows "unsupported" — it is a non-qualified optic. Onyx may block or degrade non-Mellanox optics by default. Use interface ethernet 1/1 module-type qsfp or contact NVIDIA support for the compatibility matrix. - If speed mismatch — both ends must agree on speed. Check auto-negotiation or set speed explicitly: speed 100G force. - If admin down — no shutdown in interface config.

Scenario 2: Packet Drops Visible in Counters

Symptoms: Application reports packet loss. Interface counters show rising discard or error counts.

Diagnosis steps:

# Check for drops — this is the killer command
show what-just-happened

# Detailed per-interface error breakdown
show interfaces ethernet 1/1 counters errors

# Check PFC (Priority Flow Control) counters — are we pausing?
show interfaces ethernet 1/1 counters pfc

# Check buffer utilization
show interfaces ethernet 1/1 counters buffer

WJH output explained:

WJH (What Just Happened) logs every packet dropped in hardware with the reason. Typical reasons:

WJH Reason Meaning
L2_SRC_MAC_IS_MULTICAST Source MAC is multicast — malformed frame
INGRESS_ACL Dropped by ACL
TTL_VALUE_IS_TOO_SMALL TTL expired
BUFFER_CONGESTION Buffer overflow — traffic burst exceeded capacity
BLACKHOLE_ROUTE Packet matched a null route
SRC_IP_IS_UNRESOLVED ARP not resolved for next-hop

Resolution: - Buffer congestion: tune PFC thresholds or add ECN marking. Check if the sender is bursting. - ACL drops: review show access-lists for unintended deny rules. - Blackhole route: check routing table for missing or withdrawn routes.

Scenario 3: MLAG Split-Brain

Symptoms: Both switches claim to be MLAG primary. Duplicate packets or MAC flapping on downstream hosts.

Diagnosis steps:

# Check MLAG domain status
show mlag

# Check ISL (inter-switch link) state
show mlag isl

# Check MLAG VIP reachability
ping <mlag-vip>

Resolution: - If ISL is down — the peer link between the two MLAG switches has failed. Both switches become primary and will independently forward traffic, causing duplicates. - Fix the ISL link (physical cable, optic, or intermediate device). - If the ISL cannot be restored, shut MLAG ports on the secondary switch to avoid duplicate forwarding: interface mlag-port-channel <id> shutdown. - Prevent recurrence: use a dedicated ISL link (not shared with data traffic) and consider a backup ISL path.

Scenario 4: Firmware Upgrade

Workflow:

# 1. Download image to switch
image fetch scp://user@server/path/to/image-X.Y.Z.img

# 2. Verify the image
show images

# 3. Install to the next boot partition
image install image-X.Y.Z.img

# 4. Set next boot to the new partition
image boot next

# 5. Save config
configuration write

# 6. Reload
reload

Critical notes: - Always read the release notes before upgrading. Some versions require intermediate steps. - Onyx supports dual boot partitions — you can roll back by selecting the previous partition if the new firmware fails. - In MLAG setups, upgrade one switch at a time (ISSU — In-Service Software Upgrade). Verify MLAG reconverges before upgrading the peer. - Check show images to confirm the install target partition.

Scenario 5: RDMA/RoCE Configuration

Setting up lossless Ethernet for RDMA (RoCEv2) requires coordinated PFC and ECN configuration:

# Enable PFC on priority 3 (common for RoCE)
interface ethernet 1/1 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500
interface ethernet 1/1 priority-flow-control priority 3 enable

# Set DCBX trust mode
dcb priority-flow-control trust on
dcb priority-flow-control mode on

# Verify PFC is active
show interfaces ethernet 1/1 counters pfc

# Verify ECN marking
show interfaces ethernet 1/1 counters ecn

DCBX trust mode: Must match between switch and NIC. Two options: - Trust L2 (CoS): Switch trusts the 802.1p priority bits set by the NIC - Trust DSCP: Switch trusts the IP DSCP field

If the switch is set to trust L2 but the NIC is marking DSCP (or vice versa), traffic lands in the wrong priority queue and PFC will not protect it from drops.

Gotcha: RoCE requires lossless Ethernet end-to-end. If even one hop in the path does not have PFC enabled on the correct priority, RDMA performance collapses. A single misconfigured ToR switch can make an entire GPU cluster underperform. Verify PFC on every switch in the path, not just the endpoints.

Operational Patterns

Configuration Backup and Restore

# Export running config to remote server
configuration upload scp://user@server/backups/switch-config-$(date +%F).cfg

# Restore from backup
configuration fetch scp://user@server/backups/switch-config.cfg
configuration switch-to fetched-config.cfg
configuration write

Automate daily config backups via cron on a management server using the REST API or SSH/expect.

Cable and Transceiver Diagnostics

# Show transceiver details (type, vendor, serial, DOM readings)
show interfaces ethernet 1/1 transceiver

# Check optical power levels (DOM - Digital Optical Monitoring)
show interfaces ethernet 1/1 transceiver diagnostics

DOM readings tell you if an optic is degrading before it fails completely. Watch for: - Rx power dropping below -10 dBm (fiber bend, dirty connector, failing optic) - Temperature above 70C (ventilation issue or optic approaching end of life) - Tx bias current increasing (optic compensating for degradation)

Debug clue: If Rx power is normal but you still see CRC errors, the problem is likely a dirty fiber connector or a bad fiber splice, not the optic itself. Clean both ends with an IPA wipe before replacing expensive transceivers.

UFM for Fabric-Wide Visibility

UFM (Unified Fabric Manager) provides: - Topology map — auto-discovered view of all switches and links - Health dashboard — aggregated error counters, temperature, PSU status across the fabric - Firmware orchestration — push firmware updates to multiple switches with staged rollout - Event correlation — correlate link flaps, error spikes, and congestion across the fabric

UFM connects to switches via SNMP and the REST API. It runs as a standalone appliance or VM. For small fabrics (< 20 switches), CLI management is fine. For larger fabrics, UFM pays for itself in operational efficiency.

LLDP-Based Topology Verification

# Show all LLDP neighbors
show lldp interfaces

# Verify expected cabling
show lldp interfaces ethernet 1/1

Use LLDP neighbor output to validate that physical cabling matches the intended fabric design. Script this: pull LLDP from every switch, compare against a topology definition file, alert on mismatches.

Syslog and Monitoring Integration

# Configure syslog destination
logging host 10.0.0.100 port 514

# Set logging level
logging level info

# Enable SNMP traps
snmp-server host 10.0.0.100 traps version 2c community public

Feed syslog into your centralized logging (ELK, Splunk, Loki). Key events to alert on: - Link state changes (up/down) - PSU or fan failures - Temperature threshold crossings - MLAG state changes - Firmware upgrade events

Scale note: For fabrics with more than 20 switches, CLI management becomes operationally expensive. UFM or NVIDIA Air (simulation) pays for itself in reduced human error during firmware rollouts and config pushes. Budget for fabric management tooling early.