Skip to content

Portal | Level: L1: Foundations | Topics: Firmware / BIOS / UEFI, Out-of-Band Management | Domain: Datacenter & Hardware

Scenario: Server Won't Boot After Firmware Update

Situation

At 02:15 AM, a scheduled BIOS firmware update was applied to a Dell PowerEdge R740 (production database replica, db-replica-03). The server was rebooted to complete the update but never came back online. Monitoring shows the host has been unreachable for 40 minutes. The on-call DBA is escalating because replication lag on the remaining replicas is climbing.

What You Know

  • BIOS was updated from version 2.12.2 to 2.14.1 via Lifecycle Controller
  • The server had a custom boot order: UEFI PXE disabled, BOSS card (M.2 RAID 1) as first boot device
  • iDRAC is still responding on the management network
  • Pre-update the server was healthy with no hardware alerts
  • The datacenter technician reports the front panel shows an amber blinking LED

Investigation Steps

1. Check iDRAC Virtual Console and POST Status

Command(s):

# Access iDRAC remote console to see where POST is stuck
ipmitool -I lanplus -H 10.20.1.43 -U root -P <password> chassis status

# Check system event log for POST errors
ipmitool -I lanplus -H 10.20.1.43 -U root -P <password> sel list | tail -20

# Check if chassis is powered on
ipmitool -I lanplus -H 10.20.1.43 -U root -P <password> power status
What to look for: Chassis power state should be "on". SEL entries with "OEM" or "BIOS" type errors. Look for entries like BIOS POST Error, Memory training failure, or No bootable device.

2. Check Boot Order and BIOS Settings via iDRAC

Command(s):

# Using racadm (Dell-specific) to check boot order
racadm --nocertwarn -r 10.20.1.43 -u root -p <password> get BIOS.BiosBootSettings.BootMode

racadm --nocertwarn -r 10.20.1.43 -u root -p <password> get BIOS.BiosBootSettings.BootSeq

# Check if BIOS settings were reset to defaults during update
racadm --nocertwarn -r 10.20.1.43 -u root -p <password> get BIOS.SysProfileSettings.SysProfile
What to look for: BootMode should be Uefi (not Bios). The BootSeq should list the BOSS card first. BIOS updates frequently reset boot order to defaults, putting PXE or optical drives first. If SysProfile shows PerfOptimized was lost, other tuning was also reset.

3. Check for Memory Training or Hardware Init Failure

Command(s):

# Pull Lifecycle Controller logs for POST details
racadm --nocertwarn -r 10.20.1.43 -u root -p <password> lclog view -s "POST"

# Check if the system is stuck in a memory training loop
ipmitool -I lanplus -H 10.20.1.43 -U root -P <password> raw 0x30 0x01

# Force a BMC cold reset if iDRAC is in a bad state
ipmitool -I lanplus -H 10.20.1.43 -U root -P <password> mc reset cold
What to look for: After BIOS updates, the server performs full memory retraining which can take 10-20 minutes on systems with large RAM (512GB+). The Lifecycle log will indicate if memory initialization is still in progress or if it failed on a specific DIMM.

Root Cause

The BIOS update reset all BIOS settings to factory defaults. This changed the boot mode from UEFI to Legacy BIOS, wiped the custom boot order (BOSS card was removed from the sequence), and reset the performance profile. The server was completing POST but cycling through PXE boot on all four NICs before falling through to "No Boot Device Found" -- the amber LED indicated a non-critical configuration error, not a hardware failure.

Fix

Immediate:

# Restore boot mode to UEFI
racadm --nocertwarn -r 10.20.1.43 -u root -p <password> set BIOS.BiosBootSettings.BootMode Uefi

# Set BOSS card as first boot device
racadm --nocertwarn -r 10.20.1.43 -u root -p <password> set BIOS.BiosBootSettings.BootSeq "BOSS-S1, NIC.Slot.3-1"

# Apply the pending BIOS changes and reboot
racadm --nocertwarn -r 10.20.1.43 -u root -p <password> jobqueue create BIOS.Setup.1-1 -r pwrcycle

Preventive: - Export BIOS configuration before every firmware update: racadm get -t xml -f bios_backup.xml - Include BIOS config restoration as a mandatory post-update step in the firmware update runbook - Use configuration management (e.g., Dell OpenManage or Redfish API calls in Ansible) to enforce BIOS settings and detect drift - Set up iDRAC alerts for boot failures to fire immediately rather than waiting for host-level monitoring to time out

Common Mistakes

  • Assuming the server is bricked and opening a hardware support case -- this wastes hours when the fix is a 2-minute config change
  • Power cycling repeatedly without checking the virtual console or SEL -- you cannot fix a boot order problem by rebooting
  • Not waiting long enough after BIOS update -- memory retraining on high-RAM systems can look like a hang for 15+ minutes
  • Forgetting to create a BIOS jobqueue entry -- racadm set stages the change but does not apply it until a job is created

Interview Angle

Q: A server doesn't come back after a firmware update. Walk me through how you'd troubleshoot it remotely. Good answer shape: Start with out-of-band management (iDRAC/iLO) -- check power state, virtual console, and system event log. Mention that BIOS updates commonly reset settings to defaults, especially boot order and boot mode (UEFI vs Legacy). Explain that you'd verify the boot configuration, restore it if needed, and that you always export BIOS config before firmware updates. Bonus: mention that memory retraining after BIOS updates can cause extended POST times that look like a hang.


Wiki Navigation

Prerequisites