Bare-Metal Provisioning Footguns¶

Mistakes that brick servers, corrupt installs, or leave your fleet in an inconsistent state.

1. PXE-booting the wrong server¶

You type the wrong BMC IP. The IPMI command power-cycles a production database server and force-PXE-boots it. The OS is wiped. The data is gone.

Fix: Triple-check BMC IPs against your inventory. Use host-specific DHCP reservations so only the intended MAC gets a kickstart. Name your BMC connections clearly in your SSH config.

2. Kickstart without `zerombr` and `clearpart`¶

The server has existing partitions or RAID metadata from a previous install. The kickstart installer hits a conflict and drops to an interactive prompt — which nobody can see because it's headless.

Fix: Always include zerombr and clearpart --all --initlabel in kickstart files. Wipe disks explicitly.

3. No serial console in kernel arguments¶

The server boots the installer but the console output goes to a VGA port that's connected to nothing. You have no visibility into what's happening. The install may hang on a prompt.

Fix: Always add console=tty0 console=ttyS1,115200n8 to kernel boot arguments. Configure the kickstart for fully unattended operation.

4. BMC on the production network¶

The BMC interface has full hardware control — power, console, boot device. If it's on the same network as production traffic, anyone who compromises a server can pivot to hardware-level control of every other server.

Fix: BMCs go on an isolated OOB management VLAN. No routing to production networks. Access only via jumpbox.

CVE: CVE-2019-6260 ("Pantsdown") — a vulnerability in the Aspeed AST2400/2500 BMC chip used in many server vendors allowed unauthenticated read/write access to BMC physical memory via the host interface. BMCs on production networks made this remotely exploitable.

5. Default BMC credentials in production¶

Dell ships with root/calvin. HP ships with Administrator and a random password on a pull-tab (which gets lost). Stale defaults = unauthorized power-cycles and console access.

Fix: Rotate BMC credentials as part of the provisioning pipeline. Store in a secrets manager. Audit quarterly.

Default trap: Dell iDRAC ships with root/calvin. HPE iLO ships with a sticker password that gets lost. Supermicro IPMI defaults to ADMIN/ADMIN. Shodan indexes thousands of internet-exposed BMCs with default credentials.

6. TFTP-only PXE for large images¶

Transferring a 100 MB initrd over TFTP at 512-byte blocks takes forever and fails on any packet loss. Half your servers fail to provision because TFTP timed out.

Fix: Chainload to iPXE and download images over HTTP. TFTP should only serve the tiny iPXE binary (~200 KB).

7. No post-install validation¶

The kickstart completes. You add the server to the load balancer. But the NIC bonding is wrong, NTP isn't synced, and SELinux is disabled. You find out when the audit hits.

Fix: Run an automated validation script as the last step of provisioning. Don't mark a server as ready until validation passes. Phone-home to the provisioning server with results.

8. Firmware update without staging¶

You push a BIOS update to 50 servers at once. The update has a bug that causes boot loops. All 50 servers are now down simultaneously.

Fix: Stage firmware updates. Update 1 server first. Validate for 24 hours. Then do batches of 5-10. Never update the entire fleet at once.

War story: A 2019 HPE firmware update for Gen10 servers caused boot failures when the update was applied to systems with certain NVMe drive configurations. Customers who updated entire racks simultaneously lost all compute capacity until HPE issued a hotfix 48 hours later.

9. ONIE installer cached on switch¶

You update the Cumulus Linux image on your provisioning server. But the switch still has the old installer cached in ONIE. It installs the old version. You don't notice until the switch config fails because it expects the new version.

Fix: Force ONIE to re-discover by clearing the installer cache. Verify the installed version after provisioning.

10. No inventory reconciliation¶

You provision 10 servers. Three of them have the wrong hostname because the inventory CSV had a duplicate MAC. Two servers now fight over the same IP. The DNS is wrong. Config management applies the wrong role.

Fix: Validate inventory data before provisioning. Check for duplicate MACs, IPs, and hostnames. Use a CMDB as the authoritative source and generate kickstart configs from it programmatically.