Skip to content

Incident Replay: PXE Boot Fails — UEFI Mismatch

Setup

  • System context: New server batch being provisioned via PXE. DHCP and TFTP infrastructure serves both legacy BIOS and UEFI boot clients. 4 of 10 new servers fail to PXE boot.
  • Time: Wednesday 09:00 UTC
  • Your role: Provisioning engineer

Round 1: Alert Fires

[Pressure cue: "Provisioning dashboard shows 4 servers stuck at 'Waiting for PXE boot.' Deployment schedule has these servers needed by end of day."]

What you see: The 4 failing servers show "PXE-E16: No valid offer received" on the console. The other 6 servers PXE booted fine. All servers are the same model (Dell R760).

Choose your action: - A) Check DHCP server logs for DISCOVER packets from the failing servers - B) Restart the TFTP server - C) Check if the failing servers have different BIOS/UEFI settings than the working ones - D) Try manually assigning IPs to the failing servers' MACs in DHCP

[Result: DHCP logs show DISCOVER packets from the failing servers with architecture type 0x0007 (UEFI x64). The working servers sent type 0x0000 (Legacy BIOS). The DHCP server is only configured to serve the legacy PXE boot filename (pxelinux.0), not the UEFI filename (shimx64.efi). Proceed to Round 2.]

If you chose B:

[Result: TFTP is fine — the failing servers never get far enough to contact TFTP. The problem is at the DHCP offer stage.]

If you chose C:

[Result: Correct instinct — the failing servers are set to UEFI boot while the working ones are Legacy. But you still need to check why UEFI PXE fails. Slower path to Round 2.]

If you chose D:

[Result: Static DHCP assignments do not help — the issue is the boot filename in the DHCP offer, not the IP assignment.]

Round 2: First Triage Data

[Pressure cue: "Need these servers by EOD. What is the fix?"]

What you see: The DHCP server offers pxelinux.0 (Legacy BIOS bootloader) to all clients. UEFI clients need shimx64.efi or grubx64.efi. The DHCP config does not differentiate based on client architecture.

Choose your action: - A) Switch the 4 failing servers to Legacy BIOS boot mode - B) Add DHCP option 93 (client architecture) matching to serve the correct boot filename - C) Serve the UEFI boot filename to all clients - D) Set up a separate DHCP scope for UEFI clients on a different VLAN

[Result: You add conditional logic to the DHCP config: if option architecture-type = 00:07, serve shimx64.efi; otherwise serve pxelinux.0. Restart dhcpd. UEFI clients now get the correct bootloader. Proceed to Round 3.]

If you chose A:

[Result: Legacy BIOS mode works but UEFI is the correct mode for these servers. Secure Boot, GPT, and >2TB disk support require UEFI. This is a step backward.]

If you chose C:

[Result: Legacy BIOS clients cannot boot shimx64.efi. You break the 6 working servers.]

If you chose D:

[Result: Overkill and operationally complex. DHCP option matching handles this cleanly on a single scope.]

Round 3: Root Cause Identification

[Pressure cue: "PXE booting now. Why did this happen?"]

What you see: Root cause: The PXE infrastructure was set up when all servers used Legacy BIOS. New server models default to UEFI. The DHCP configuration was never updated to handle both boot modes.

Choose your action: - A) Document the DHCP config change and update the provisioning runbook - B) Standardize all servers on UEFI going forward - C) Add automated testing for both UEFI and Legacy PXE boot paths - D) All of the above

[Result: Config documented, UEFI standardized for new hardware, automated PXE boot tests added to provisioning CI. Proceed to Round 4.]

If you chose A:

[Result: Documentation helps but does not prevent the next gap.]

If you chose B:

[Result: Good standard but legacy servers still exist and need PXE support.]

If you chose C:

[Result: Testing catches regressions but you also need the documentation and standards.]

Round 4: Remediation

[Pressure cue: "All 10 servers provisioned. Close out."]

Actions: 1. Verify all 4 UEFI servers completed PXE installation 2. Verify DHCP is serving correct boot filenames for both architectures 3. Update provisioning documentation with UEFI/Legacy handling 4. Add PXE boot smoke test to the provisioning pipeline 5. Audit fleet for any other servers that may have UEFI/Legacy mismatches

Damage Report

  • Total downtime: 0 (new provisioning, not production)
  • Blast radius: 4 servers delayed by 3 hours
  • Optimal resolution time: 15 minutes (check DHCP logs -> add arch matching -> restart -> verify)
  • If every wrong choice was made: 6+ hours with BIOS mode changes and VLAN reconfiguration

Cross-References