Skip to content

Portal | Level: L2: Operations | Topics: Out-of-Band Management, Linux Networking Tools | Domain: Datacenter & Hardware

Scenario: Out-of-Band Management (iDRAC) Unreachable

Situation

At 16:05 PM, an engineer tries to access the iDRAC web console on db-primary-01 (Dell PowerEdge R750) to check hardware health before a planned maintenance window tonight. The iDRAC at 10.20.5.61 is not responding to ping, HTTPS, or SSH. However, the host OS (10.30.2.18) is fully operational -- the database is running, applications are connected, and OS-level monitoring shows no issues. Without iDRAC access, the team cannot perform remote BIOS changes or use virtual media for tonight's maintenance.

What You Know

  • iDRAC IP is 10.20.5.61 on the dedicated management NIC (VLAN 500, management network)
  • Host OS IP is 10.30.2.18 on the production network (VLAN 200)
  • iDRAC was accessible last week -- no known changes to the server
  • The datacenter network team performed a switch replacement on the management network TOR switch two days ago
  • Other servers on the same management switch are reachable via their iDRAC/iLO
  • You have SSH access to the host OS as root

Investigation Steps

1. Check iDRAC Network Configuration from the Host OS

Command(s):

# Use ipmitool locally on the host to query the BMC network settings
sudo ipmitool lan print 1

# Check key fields: IP, subnet, gateway, MAC, VLAN, and source (DHCP vs static)
sudo ipmitool lan print 1 | grep -E "IP Address|Subnet|Gateway|MAC|VLAN|Source"

# Check if the BMC is responding to local IPMI at all
sudo ipmitool mc info

# Verify the BMC is not hung
sudo ipmitool mc selftest
What to look for: Confirm the IP address is what you expect (10.20.5.61). Check IP Address Source -- if it says "DHCP" and the DHCP server changed or is unreachable, the BMC may have fallen back to a link-local address or 0.0.0.0. Check 802.1q VLAN ID -- if this shows Disabled but the management switch port is expecting VLAN 500 tagged traffic, frames will be dropped. The mc selftest should return "passed" -- if it returns errors or hangs, the BMC firmware is in a bad state.

2. Check VLAN Tag and Network Path

Command(s):

# Check if VLAN tagging is configured on the BMC
sudo ipmitool lan print 1 | grep "802.1q VLAN"

# Verify the dedicated management NIC is the one being used (not shared)
sudo ipmitool delloem nic-config --get  # Dell-specific
# Or use racadm from the host
sudo racadm get iDRAC.NIC.Selection
sudo racadm get iDRAC.NIC.DNSRacName
sudo racadm get iDRAC.IPv4.Address
sudo racadm get iDRAC.NIC.VLANId
sudo racadm get iDRAC.NIC.VLANEnable

# Check the physical link state of the iDRAC NIC from the OS side
sudo ethtool usb0 2>/dev/null  # iDRAC dedicated NIC sometimes appears as usb0
sudo ipmitool lan print 1 | grep "Link"
What to look for: If VLANEnable is Disabled but the new switch expects a VLAN tag, or if VLANId is wrong (e.g., 1 instead of 500), that explains the unreachability. After the TOR switch replacement, the new switch may have a different port configuration -- the old switch may have had the management port as an untagged access port on VLAN 500, while the new switch expects the iDRAC to tag its own traffic. Also check iDRAC.NIC.Selection -- if set to Shared with failover or Shared LOM, the iDRAC is sharing the production NIC and may be affected by production network config, not the management network.

3. Check ARP and Reachability from Adjacent Hosts

Command(s):

# From another server on the same management VLAN, check ARP
# (SSH to a neighboring server first)
ssh admin@mgmt-jumpbox
ping -c 3 10.20.5.61
arp -n | grep 10.20.5.61

# Check if the MAC address of the iDRAC is in the switch's MAC table
# (if you have switch access)
# On the switch: show mac address-table | include <idrac-mac>

# From the host itself, try to reach the BMC via the internal USB interface
sudo ipmitool -I open channel info 1

# Try a raw BMC network stack restart without full reset
sudo racadm racreset soft
What to look for: If arp shows no entry or an incomplete entry for 10.20.5.61, the iDRAC's frames are not reaching the network. This confirms a Layer 2 issue (VLAN mismatch, port down, or cable disconnected). If the MAC address is not in the switch's MAC table, the iDRAC NIC may not be physically connected or the port may be administratively shut down on the new switch.

Root Cause

When the management network TOR switch was replaced, the port configuration was migrated from the old switch but the port connecting to db-primary-01's iDRAC was configured as a trunk port expecting the iDRAC to send VLAN 500-tagged traffic. The old switch had this port as an access port (untagged VLAN 500). The iDRAC's VLAN tagging was disabled (VLANEnable = Disabled), meaning it sent untagged Ethernet frames. The new switch dropped these untagged frames because no native VLAN was configured on the trunk port. All other servers worked because their iDRAC VLAN tagging was enabled (a previous admin had configured them differently during a past migration).

Fix

Immediate:

# Option 1 (preferred): Enable VLAN tagging on the iDRAC to match the new switch config
sudo racadm set iDRAC.NIC.VLANEnable Enabled
sudo racadm set iDRAC.NIC.VLANId 500
sudo racadm set iDRAC.NIC.VLANPriority 0

# Apply changes -- this resets the iDRAC network stack
sudo racadm racreset soft

# Wait 2-3 minutes for iDRAC to restart, then test
sleep 180
ping -c 3 10.20.5.61

# Option 2: Ask the network team to change the switch port to access mode (untagged VLAN 500)
# This avoids touching the server config but requires switch access

# Verify iDRAC is now reachable
ipmitool -I lanplus -H 10.20.5.61 -U root -P <password> mc info

Preventive: - Maintain a spreadsheet/CMDB mapping every server's iDRAC MAC, IP, VLAN config, and NIC mode (dedicated vs shared) - Before switch replacements, export the full port config and diff against the new switch config for every port - Standardize iDRAC network configuration across the fleet -- either all servers use VLAN tagging or none do; mixed configurations cause exactly this kind of issue - Add iDRAC/iLO ping checks to monitoring (e.g., Nagios check_ping on the management network) so you discover OOB unreachability immediately, not weeks later when you need it - Use configuration management (Ansible + racadm or Redfish API) to enforce consistent iDRAC network settings across all servers

Common Mistakes

  • Immediately doing a full BMC cold reset (ipmitool mc reset cold) before investigating -- this can cause a 5-10 minute outage on the BMC and does not fix network configuration issues
  • Assuming the iDRAC hardware is dead because it does not respond to ping -- in most cases, iDRAC unreachability is a network configuration issue, not hardware failure
  • Forgetting that ipmitool can be run locally on the host OS to query and configure the BMC -- you do not need network access to the iDRAC to troubleshoot it
  • Not checking whether the iDRAC is using the dedicated NIC or sharing the production LOM -- Shared LOM mode means the iDRAC shares a physical port with the host OS and is affected by host-side network config, VLANs, and switch port settings for that production port
  • Changing the iDRAC IP or VLAN config remotely over the network -- if the change breaks connectivity, you have locked yourself out and must use the local OS-level ipmitool/racadm to recover

Interview Angle

Q: Your out-of-band management (iDRAC/iLO) is unreachable but the server's OS is fine. How do you troubleshoot? Good answer shape: Since the host OS is up, use ipmitool lan print 1 locally to check the BMC's network configuration -- IP address, subnet, gateway, VLAN tagging, and whether it is using a dedicated or shared NIC. The most common causes are VLAN misconfiguration (especially after switch changes), DHCP lease expiration, or the BMC NIC being set to shared mode when it should be dedicated. I would check the VLAN settings match the switch port configuration, verify the MAC address is appearing in the switch's MAC table, and use racadm racreset soft if the BMC network stack needs a restart. A strong answer emphasizes that OOB management is critical infrastructure and should be monitored proactively, not discovered broken when you need it.


Wiki Navigation

Prerequisites