Skip to content

Incident Replay: OS Install Fails — RAID Controller Not Detected

Setup

  • System context: New Dell PowerEdge R750 being provisioned. PXE boot to the OS installer succeeds but no disks are visible during installation.
  • Time: Tuesday 10:00 UTC
  • Your role: Datacenter provisioning engineer

Round 1: Alert Fires

[Pressure cue: "Provisioning pipeline reports failure for 3 of 8 new servers. Deployment deadline is tomorrow. The other 5 installed fine."]

What you see: The OS installer (Ubuntu 22.04) boots via PXE and reaches the disk selection screen. No disks are listed. The server has a PERC H755 RAID controller with 8 SAS drives configured in RAID-6.

Choose your action: - A) Check if the RAID virtual disk was created in the PERC controller BIOS - B) Try a different OS installer image - C) Check the PERC controller firmware version against the OS compatibility matrix - D) Switch the controller mode from RAID to HBA/passthrough

If you chose A:

[Result: PERC BIOS (Ctrl+R at boot) shows the RAID-6 virtual disk is properly configured — 6 drives in RAID-6 + 2 hot spares. The VD exists. The issue is the OS installer not seeing it.]

[Result: The PERC H755 firmware is 52.16.1. The Ubuntu 22.04 installer kernel (5.15) does not include the megaraid_sas driver for this firmware version — it needs kernel 5.19+. The 5 successful servers had PERC H740s (older, fully supported). Proceed to Round 2.]

If you chose B:

[Result: Trying an older installer makes it worse — even less hardware support. Trying a newer daily build might work but is untested.]

If you chose D:

[Result: Switching to HBA mode destroys the RAID configuration. The disks would appear individually but you lose RAID-6 protection.]

Round 2: First Triage Data

[Pressure cue: "5 servers installed, 3 blocked. Provisioning team needs a workaround today."]

What you see: The installer kernel lacks the driver for the PERC H755 on this firmware. Options: load the driver manually during install, use a newer kernel, or update the PERC firmware to a version compatible with the older driver.

Choose your action: - A) Inject the megaraid_sas driver into the installer initrd - B) Use Ubuntu 23.04 installer (kernel 6.2) which includes the newer driver - C) Downgrade PERC firmware to a version compatible with the installer kernel - D) Install the OS on a USB drive and add the driver post-install

[Result: Download the Dell driver pack, extract the megaraid_sas.ko for kernel 5.15, inject it into the PXE initrd. Re-PXE the servers — RAID virtual disk now appears in the installer. Proceed to Round 3.]

If you chose B:

[Result: Works but 23.04 is not an LTS release — the production standard is 22.04 LTS. Using a non-standard OS creates support and patching burden.]

If you chose C:

[Result: Firmware downgrade is risky and may not be possible if the RAID VD was created on the newer firmware. Potential data loss.]

If you chose D:

[Result: Installing on USB is fragile and non-standard. The provisioning pipeline expects standard disk install.]

Round 3: Root Cause Identification

[Pressure cue: "Drivers injected, installs proceeding. Update the provisioning pipeline."]

What you see: Root cause: The provisioning pipeline's PXE image did not include drivers for the newest PERC controller model. The 5 successful servers had older controllers. Driver compatibility was not tested before the new hardware was racked.

Choose your action: - A) Permanently add the PERC H755 driver to the PXE initrd - B) Create a driver injection framework in the provisioning pipeline - C) Switch to Ubuntu 24.04 LTS which includes the driver natively - D) All of the above (build framework + upgrade path + immediate fix)

[Result: Immediate fix (inject driver), medium-term (driver framework for any new hardware), long-term (OS upgrade path to 24.04). Proceed to Round 4.]

If you chose A:

[Result: Fixes this controller but the same problem will recur with the next new hardware.]

If you chose B:

[Result: Framework is the right architectural approach but takes development time.]

If you chose C:

[Result: Good long-term but 24.04 may not be validated for production yet.]

Round 4: Remediation

[Pressure cue: "All 8 servers provisioned. Document the driver requirements."]

Actions: 1. Verify all 3 servers installed successfully and are PXE booting to OS 2. Verify RAID virtual disks are detected: lsblk and megacli -LDInfo -Lall -aALL 3. Update the provisioning runbook with PERC H755 driver requirements 4. Add hardware compatibility pre-check to the provisioning pipeline 5. Test the updated PXE image against all current server models in the fleet

Damage Report

  • Total downtime: 0 (new provisioning, not production)
  • Blast radius: 3 servers delayed by 4 hours
  • Optimal resolution time: 30 minutes (identify driver gap -> inject -> reinstall)
  • If every wrong choice was made: 8+ hours including firmware downgrade risks and non-standard OS installs

Cross-References