Incident Replay: BIOS Settings Reverted After CMOS Battery Replacement¶
Setup¶
- System context: Dell PowerEdge R640 in a 200-node production cluster. Server was pulled for CMOS battery replacement during a scheduled maintenance window.
- Time: Sunday 06:15 UTC
- Your role: Datacenter technician / on-call SRE
Round 1: Alert Fires¶
[Pressure cue: "Nagios fires CRITICAL — server boot time exceeded 15-minute threshold. Auto-escalation in 5 minutes."]
What you see: Server powered on after CMOS battery swap but is not rejoining the cluster. iDRAC shows the host is at the BIOS POST screen. Remote console shows boot order changed to PXE-first.
Choose your action: - A) Manually PXE boot and re-image the server from scratch - B) Access iDRAC virtual console to inspect current BIOS settings - C) Power-cycle the server and hope it boots correctly - D) Check the DHCP server for a new PXE lease from this host
If you chose A:¶
[Result: You kick off a 45-minute re-image cycle. The server comes back but you've lost the local RAID config and any non-standard BIOS tuning. Rollback needed. Try again.]
If you chose B (recommended):¶
[Result: Virtual console shows BIOS defaults loaded. Boot order is wrong (PXE first instead of RAID controller), Turbo Boost is disabled, virtualization extensions are off. Proceed to Round 2.]
If you chose C:¶
[Result: Server reboots, hits PXE again, times out, falls to BIOS shell. Same problem. 5 minutes wasted.]
If you chose D:¶
[Result: DHCP shows a DISCOVER from the server MAC. Confirms it is PXE booting but does not explain why. Partial clue — eventually leads to Round 2 after 8 minutes.]
Round 2: First Triage Data¶
[Pressure cue: "Cluster capacity is at 92%. Another node failure and pods start getting evicted."]
What you see: BIOS settings are factory defaults. The CMOS battery replacement cleared NVRAM. You need the correct BIOS profile for this server class.
Choose your action: - A) Manually reconfigure each BIOS setting from memory - B) Check if a BIOS configuration profile was exported before maintenance - C) Copy settings from a known-good sibling server via Redfish API - D) Flash the BIOS firmware to force a settings reload
If you chose A:¶
[Result: You miss the SR-IOV setting and the custom NUMA interleaving config. Server boots but NIC passthrough fails later. Partial fix — 20 more minutes to find what you missed.]
If you chose B (recommended):¶
[Result: You find a saved XML BIOS profile in the provisioning repo under
configs/r640-prod-bios.xml. This is the golden config. Proceed to Round 3.]
If you chose C:¶
[Result: Works if a sibling is available. You pull the config via
racadm get -t xml -f bios.xml. Gets you to Round 3 data, but takes 10 extra minutes to identify a matching server.]
If you chose D:¶
[Result: Firmware flash does not restore BIOS settings — it only updates microcode. You are now on a newer firmware with still-default settings and a 25-minute flash cycle wasted.]
Round 3: Root Cause Identification¶
[Pressure cue: "Change manager is asking why a 10-minute battery swap has turned into a 30-minute outage."]
What you see: Root cause confirmed: CMOS battery replacement cleared NVRAM, which stores BIOS settings. The maintenance SOP did not include a pre-swap BIOS export step.
Choose your action: - A) Apply the saved BIOS profile via iDRAC and reboot - B) Apply settings via Redfish API for faster automation - C) Manually re-enter settings from the XML file through the BIOS GUI - D) Skip BIOS restore and just fix the boot order to get the server up fast
If you chose A (recommended):¶
[Result: Profile imported via iDRAC. Server reboots with correct settings — boot order, Turbo Boost, VT-x, SR-IOV all correct. Proceed to Round 4.]
If you chose B:¶
[Result: Also correct. Redfish PATCH to
/redfish/v1/Systems/System.Embedded.1/Bios/Settingsapplies the config. Slightly faster if scripted. Proceed to Round 4.]
If you chose C:¶
[Result: Works but takes 15 minutes of clicking through BIOS menus. High risk of typos. Gets to Round 4 eventually.]
If you chose D:¶
[Result: Server boots the OS but performance is degraded (no Turbo Boost), VMs fail to start (no VT-x). You get paged again in 2 hours. Not a real fix.]
Round 4: Remediation¶
[Pressure cue: "Server is back in the cluster. Now prevent this from happening again."]
Actions:
1. Verify server rejoined the cluster: kubectl get nodes | grep <hostname>
2. Confirm BIOS settings match golden profile: racadm get BIOS.SysProfileSettings
3. Update the maintenance SOP to include mandatory BIOS export before any CMOS work
4. Add a BIOS config backup to the nightly provisioning automation
5. Tag the golden BIOS profile in version control with the firmware version
Damage Report¶
- Total downtime: 35 minutes (server out of cluster)
- Blast radius: Cluster capacity reduced by one node; no pod evictions occurred
- Optimal resolution time: 12 minutes (inspect -> import profile -> reboot -> verify)
- If every wrong choice was made: 90+ minutes plus secondary incidents from missing settings
Cross-References¶
- Primer: Datacenter & Server Hardware
- Primer: Redfish API
- Primer: Dell PowerEdge Servers
- Footguns: Firmware