Datacenter Footguns¶
Mistakes that take down hardware, lose data, or cause outages that no software fix can solve.
1. Single power feed to a rack¶
You connect all servers in a rack to one PDU on one power circuit. The circuit breaker trips. Every server in the rack goes down simultaneously. Your "redundant" application wasn't redundant — all replicas were in the same rack.
Fix: Dual power supplies connected to A+B power feeds from separate PDUs/circuits. Spread replicas across racks.
War story: In 2017, a major British Airways data center lost both A and B power feeds simultaneously due to an engineer's incorrect UPS configuration change. The resulting power surge damaged servers across multiple racks. The outage lasted 3 days, affected 75,000 passengers, and cost an estimated 80 million GBP. Dual feeds only provide redundancy if they are truly independent — separate UPS units, separate transfer switches, and separate utility circuits.
2. RAID 5 on large drives¶
You build a RAID 5 array with 8TB drives. A drive fails. During the 12-hour rebuild, a second drive fails (unrecoverable read errors are likely with large drives under heavy I/O). The entire array is lost.
Fix: Use RAID 6 (dual parity) or RAID 10 for large drives. RAID 5 is only suitable for small arrays with small drives. Always have verified backups regardless of RAID level.
3. Firmware update on the production cluster at 2pm¶
You update the BIOS or NIC firmware on a production server during business hours. The update requires a reboot. The reboot takes 15 minutes because of memory training and POST checks. The server doesn't come back because the firmware update failed.
Fix: Firmware updates during maintenance windows only. Test on identical non-production hardware first. Have BMC/IPMI access to recover from failed updates.
4. PXE boot targeting the wrong VLAN¶
You set up PXE boot for new servers but the DHCP scope overlaps with production. A production server reboots, hits PXE, and gets reimaged with a blank OS. Your production database server is now running a fresh Ubuntu install.
Fix: Isolate PXE/provisioning on a separate VLAN. Use DHCP reservations. Disable PXE boot in BIOS on production servers after initial provisioning.
5. No labeled cables¶
You need to swap a network cable. None of them are labeled. You unplug what you think is the right one. It was the uplink for the entire rack. 48 servers go offline.
Fix: Label both ends of every cable. Use color-coded cables (e.g., yellow for management, blue for production). Maintain a cable map. Use a flashlight and trace before unplugging.
6. Running a degraded RAID array "until the replacement arrives"¶
A drive fails in your RAID array. You order a replacement. The drive arrives in 3 days. During those 3 days, you run degraded with no fault tolerance. A second drive fails on day 2.
Fix: Keep hot spares. Modern RAID controllers automatically rebuild to a hot spare. Order replacements immediately and treat RAID degradation as a P1 incident.
7. Ignoring SMART warnings¶
Your monitoring shows SMART warnings — reallocated sectors, pending sectors, read errors. You dismiss them as "the drive is probably fine." The drive fails completely during peak traffic, taking a RAID rebuild with it.
Fix: Replace drives at the first sign of SMART degradation. Automate SMART monitoring and alerting. A $200 drive is cheaper than a data recovery engagement.
8. Testing UPS by pulling the plug unannounced¶
You want to verify the UPS works. You pull the power cord from the wall. The UPS batteries are 5 years old and can only hold the load for 30 seconds instead of 15 minutes. Servers start shutting down.
Fix: Test UPS in a controlled manner during maintenance windows. Replace batteries on schedule (every 3-5 years). Monitor UPS health continuously. Do load testing with graceful shutdown scripted.
9. Hot aisle / cold aisle violations¶
Someone installs a server backwards — exhaust facing the cold aisle. Or blanking panels are missing. Hot air recirculates into the cold aisle. Inlet temperatures rise. Thermal throttling kicks in. Performance degrades silently across multiple servers.
Fix: Enforce hot/cold aisle containment. Install blanking panels in empty rack units. Monitor inlet temperatures per server. Orient all servers consistently.
10. No out-of-band management configured¶
Your server's OS is frozen. You can't SSH in. You don't have IPMI/BMC configured. The server is in a data center 500 miles away. Someone has to physically visit to press the power button.
Fix: Configure IPMI/BMC on every server during initial provisioning. Verify it works. Keep BMC credentials in a secure, accessible location. Test remote console access before you need it.
11. Capacity planning by gut feel¶
You order 10 servers because "that feels like enough." Traffic doubles in Q4. You need more capacity but the lead time for new hardware is 8 weeks. You're running at 95% utilization with no headroom for a traffic spike.
Fix: Track utilization trends. Plan for 60-70% peak utilization to leave headroom. Order hardware before you need it. Know your vendor's lead time and plan 2 quarters ahead.
OOB Management & Provisioning Footguns¶
12. Leaving BMC credentials at factory defaults¶
Every iDRAC ships with root/calvin. Every iLO ships with a sticker password that someone photographed during rack-and-stack. An attacker (or a bored intern) discovers the BMC on your management VLAN and now has full hardware control -- power cycle, virtual media mount, console access, firmware flash.
Fix: Automate BMC credential provisioning as part of initial rack config. Use Redfish or racadm/hponcfg scripts. No server leaves staging with default creds.
13. Putting BMC interfaces on the production network¶
The BMC is a tiny computer running ancient firmware with a patching cadence measured in years. Exposing it to the production VLAN means every CVE in the BMC firmware is exploitable from any compromised workload.
Fix: Dedicated management VLAN with ACLs. BMC traffic never touches production. Access the management VLAN through a jump host or VPN. No exceptions.
14. Power-cycling a server instead of graceful shutdown via IPMI¶
The server is unresponsive. You fire ipmitool chassis power cycle immediately. The OS had a hung process but the filesystem was still live. You just caused a dirty shutdown -- potential fsck on boot, journal corruption, database crash recovery that takes 45 minutes.
Fix: Try ipmitool chassis power soft first (sends ACPI shutdown signal). Wait 60 seconds. Check SOL console for activity. Only hard-cycle if soft power fails and you have confirmed the OS is truly unresponsive.
15. Not testing PXE boot before decommissioning the old provisioning server¶
You migrate your DHCP/TFTP/kickstart infrastructure to a new server. Next week someone racks new hardware. PXE boot fails -- the DHCP option 66/67 points to the old server.
Fix: After any provisioning infrastructure change, PXE boot a test machine end-to-end. Keep the old server available for 30 days as a rollback.
16. Hardcoding MAC addresses in DHCP reservations for provisioning¶
A NIC gets replaced during hardware repair. The new NIC has a different MAC. The server PXE boots into the wrong profile -- or doesn't boot at all. At scale, your DHCP config becomes an unmanageable spreadsheet of MACs.
Fix: Use DHCP pools scoped to your provisioning VLAN. Identify servers by serial number, asset tag, or system UUID in the kickstart/preseed workflow -- not by MAC address.
17. Forgetting to set boot order back after PXE provisioning¶
You set a server to PXE boot for OS install. The install completes. You forget to reset the boot order. Next reboot, the server PXE boots again and starts reinstalling the OS. Production workload gone.
Fix: Use ipmitool chassis bootdev pxe options=efiboot for one-shot PXE (next boot only). Or configure the kickstart post-install script to reset boot order.
18. Running firmware updates on BMC during production hours¶
BMC firmware updates require a BMC reboot. During the BMC reboot (2-5 minutes), you lose all out-of-band access. If the host OS crashes during that window, you cannot recover remotely. A failed firmware flash can brick the BMC.
Fix: Schedule BMC firmware updates during maintenance windows. Update one server at a time in each rack. Verify BMC comes back online before moving to the next.
19. Ignoring SOL console output during boot failures¶
The server is stuck in a boot loop. You check the BMC web UI, see the power state, try power cycling. You never open the SOL console. The answer is right there -- a BIOS error about a failed DIMM, a GRUB prompt waiting for input, a kernel panic with a clear message.
Fix: SOL console is your first tool for boot failures. ipmitool -I lanplus -H <bmc-ip> -U admin -P <pass> sol activate. Read the screen. 90% of boot failures have a visible error message.
20. Not validating kickstart/preseed files before deploying to production¶
Your kickstart file has a typo in the disk partitioning section. The next 30 servers all install with the wrong partition layout.
Fix: Validate kickstart files with ksvalidator (from pykickstart). Test every change against a VM before deploying. Version control your kickstart files with code review.
21. Using the BMC virtual media mount for production OS installs¶
You mount an ISO via the BMC virtual console to install the OS on 10 servers. Each install takes 45 minutes because the ISO is being streamed over the BMC's 100Mbps management interface.
Fix: Virtual media is for emergencies and one-offs. Any repeatable install workflow belongs in PXE + kickstart/preseed. Invest the time to set up PXE once; it pays for itself after 3 servers.
Pages that link here¶
- Anti-Primer: Datacenter
- Datacenter & Server Hardware
- Datacenter Ops Domain
- Incident Replay: Link Flaps from Bad Optic
- Incident Replay: Memory ECC Errors Increasing
- Incident Replay: OS Install Fails — RAID Controller Not Detected
- Incident Replay: PXE Boot Fails — UEFI Mismatch
- Incident Replay: Power Supply Redundancy Lost
- Incident Replay: Rack PDU Overload Alert
- Incident Replay: Serial Console Output Garbled
- Incident Replay: Server Intermittent Reboots
- Incident Replay: Thermal Throttling from Fan Failure
- Incident Replay: iDRAC Unreachable but OS Running