Incident Replay: OS Install Fails — RAID Controller Not Detected¶
Setup¶
- System context: New Dell PowerEdge R750 being provisioned. PXE boot to the OS installer succeeds but no disks are visible during installation.
- Time: Tuesday 10:00 UTC
- Your role: Datacenter provisioning engineer
Round 1: Alert Fires¶
[Pressure cue: "Provisioning pipeline reports failure for 3 of 8 new servers. Deployment deadline is tomorrow. The other 5 installed fine."]
What you see: The OS installer (Ubuntu 22.04) boots via PXE and reaches the disk selection screen. No disks are listed. The server has a PERC H755 RAID controller with 8 SAS drives configured in RAID-6.
Choose your action: - A) Check if the RAID virtual disk was created in the PERC controller BIOS - B) Try a different OS installer image - C) Check the PERC controller firmware version against the OS compatibility matrix - D) Switch the controller mode from RAID to HBA/passthrough
If you chose A:¶
[Result: PERC BIOS (Ctrl+R at boot) shows the RAID-6 virtual disk is properly configured — 6 drives in RAID-6 + 2 hot spares. The VD exists. The issue is the OS installer not seeing it.]
If you chose C (recommended):¶
[Result: The PERC H755 firmware is 52.16.1. The Ubuntu 22.04 installer kernel (5.15) does not include the
megaraid_sasdriver for this firmware version — it needs kernel 5.19+. The 5 successful servers had PERC H740s (older, fully supported). Proceed to Round 2.]
If you chose B:¶
[Result: Trying an older installer makes it worse — even less hardware support. Trying a newer daily build might work but is untested.]
If you chose D:¶
[Result: Switching to HBA mode destroys the RAID configuration. The disks would appear individually but you lose RAID-6 protection.]
Round 2: First Triage Data¶
[Pressure cue: "5 servers installed, 3 blocked. Provisioning team needs a workaround today."]
What you see: The installer kernel lacks the driver for the PERC H755 on this firmware. Options: load the driver manually during install, use a newer kernel, or update the PERC firmware to a version compatible with the older driver.
Choose your action: - A) Inject the megaraid_sas driver into the installer initrd - B) Use Ubuntu 23.04 installer (kernel 6.2) which includes the newer driver - C) Downgrade PERC firmware to a version compatible with the installer kernel - D) Install the OS on a USB drive and add the driver post-install
If you chose A (recommended):¶
[Result: Download the Dell driver pack, extract the
megaraid_sas.kofor kernel 5.15, inject it into the PXE initrd. Re-PXE the servers — RAID virtual disk now appears in the installer. Proceed to Round 3.]
If you chose B:¶
[Result: Works but 23.04 is not an LTS release — the production standard is 22.04 LTS. Using a non-standard OS creates support and patching burden.]
If you chose C:¶
[Result: Firmware downgrade is risky and may not be possible if the RAID VD was created on the newer firmware. Potential data loss.]
If you chose D:¶
[Result: Installing on USB is fragile and non-standard. The provisioning pipeline expects standard disk install.]
Round 3: Root Cause Identification¶
[Pressure cue: "Drivers injected, installs proceeding. Update the provisioning pipeline."]
What you see: Root cause: The provisioning pipeline's PXE image did not include drivers for the newest PERC controller model. The 5 successful servers had older controllers. Driver compatibility was not tested before the new hardware was racked.
Choose your action: - A) Permanently add the PERC H755 driver to the PXE initrd - B) Create a driver injection framework in the provisioning pipeline - C) Switch to Ubuntu 24.04 LTS which includes the driver natively - D) All of the above (build framework + upgrade path + immediate fix)
If you chose D (recommended):¶
[Result: Immediate fix (inject driver), medium-term (driver framework for any new hardware), long-term (OS upgrade path to 24.04). Proceed to Round 4.]
If you chose A:¶
[Result: Fixes this controller but the same problem will recur with the next new hardware.]
If you chose B:¶
[Result: Framework is the right architectural approach but takes development time.]
If you chose C:¶
[Result: Good long-term but 24.04 may not be validated for production yet.]
Round 4: Remediation¶
[Pressure cue: "All 8 servers provisioned. Document the driver requirements."]
Actions:
1. Verify all 3 servers installed successfully and are PXE booting to OS
2. Verify RAID virtual disks are detected: lsblk and megacli -LDInfo -Lall -aALL
3. Update the provisioning runbook with PERC H755 driver requirements
4. Add hardware compatibility pre-check to the provisioning pipeline
5. Test the updated PXE image against all current server models in the fleet
Damage Report¶
- Total downtime: 0 (new provisioning, not production)
- Blast radius: 3 servers delayed by 4 hours
- Optimal resolution time: 30 minutes (identify driver gap -> inject -> reinstall)
- If every wrong choice was made: 8+ hours including firmware downgrade risks and non-standard OS installs
Cross-References¶
- Primer: Datacenter & Server Hardware
- Primer: Bare-Metal Provisioning
- Primer: Dell PowerEdge Servers
- Footguns: Datacenter