Portal | Level: L1: Foundations | Topics: Rack & Stack, Out-of-Band Management, RAID, Firmware / BIOS / UEFI | Domain: Datacenter & Hardware
Datacenter & Server Hardware Ops - Primer¶
Why This Matters¶
Every cloud instance runs on physical hardware in a datacenter. When you manage on-prem infrastructure, troubleshoot cloud performance, or design for reliability, understanding the physical layer gives you an edge most DevOps engineers lack.
Core Concepts¶
The Rack¶
A standard datacenter rack is 42U (rack units) tall. One U = 1.75 inches. A 1U server is the thinnest; a 2U or 4U server has more drive bays, PCIe slots, and cooling capacity.
Fun fact: The 42U standard comes from the EIA-310 specification (Electronic Industries Alliance). The 19-inch width dates back to 1920s telephone relay racks. The same physical format has been used for over 100 years — from telephone switches to AI GPU clusters.
Key components in a rack: - Servers (compute) - 1U-4U each - Network switches - Top-of-Rack (ToR) or End-of-Row (EoR) - Patch panels - structured cabling termination - PDUs (Power Distribution Units) - A-feed and B-feed for redundancy - Cable management - vertical and horizontal organizers
Out-of-Band Management¶
Name origin: "Out-of-band" (OOB) means the management channel is separate from the primary data path. The term comes from telecommunications, where signaling (control) is carried on different frequencies or channels than voice (data). In datacenter context, the BMC's dedicated network port is "out of band" from the server's production NICs — so you can manage a server even when its OS, production network, or primary NIC is completely dead.
Every enterprise server has a dedicated management interface that works even when the OS is dead: - Dell: iDRAC (Integrated Dell Remote Access Controller) - HP/HPE: iLO (Integrated Lights-Out) - Supermicro: IPMI/BMC - Generic: IPMI (Intelligent Platform Management Interface)
These provide: remote console, power control, hardware monitoring, firmware updates, virtual media mounting.
BIOS/UEFI¶
The firmware layer between hardware and OS. Key settings: - Boot order - UEFI vs Legacy, PXE boot position - Performance profiles - max performance vs power saving - Virtualization - VT-x/VT-d, SR-IOV for NIC passthrough - Security - TPM, Secure Boot
Storage: RAID¶
RAID (Redundant Array of Independent Disks) provides redundancy and/or performance:
| Level | Min Disks | Redundancy | Use Case |
|---|---|---|---|
| RAID 0 | 2 | None (striping) | Temp data, scratch space |
| RAID 1 | 2 | Mirror | OS drives, boot volumes |
| RAID 5 | 3 | 1 disk failure | General purpose (read-heavy) |
| RAID 6 | 4 | 2 disk failures | Large arrays, safety margin |
| RAID 10 | 4 | 1 per mirror pair | Databases, write-heavy |
Controllers: Dell PERC, HP Smart Array, LSI MegaRAID. CLI tools: perccli, storcli, megacli.
Remember: RAID level mnemonic: "0 is Zero protection, 1 is a Mirror, 5 has One parity, 6 has Two parities, 10 is 1+0 (mirrors then stripes)." For production databases, RAID 10 is the gold standard — best write performance with per-pair redundancy. RAID 5 is common for read-heavy workloads, but rebuild times on large disks (8TB+) are dangerous — a second disk failure during rebuild means total data loss.
Gotcha: RAID is not a backup. RAID protects against hardware disk failure. It does NOT protect against accidental deletion, corruption, ransomware, or controller failure. A deleted file on RAID 10 is deleted across all mirrors simultaneously. Always combine RAID with backups.
Disk Health: SMART¶
Self-Monitoring, Analysis and Reporting Technology. Every disk tracks its own health metrics: - Reallocated Sector Count - bad sectors remapped; rising count = failing disk - Current Pending Sector - sectors waiting to be remapped - Temperature - sustained high temps reduce lifespan - Power-On Hours - total runtime
Tool: smartctl -a /dev/sda
Provisioning Pipeline¶
How a bare-metal server goes from "box on the dock" to "running in production":
Rack & cable -> BIOS config -> iDRAC setup -> PXE boot -> OS install
-> Ansible config -> Join cluster -> Workload scheduling
Key technologies: PXE, DHCP, TFTP, kickstart/preseed/cloud-init, MAAS, Foreman, Ironic.
Name origin: PXE stands for Preboot eXecution Environment (pronounced "pixie"). It was developed by Intel in 1999 as part of the Wired for Management specification. The NIC firmware contains a PXE ROM that acts as a mini DHCP client and TFTP downloader — enough to bootstrap a full OS installer over the network without any local storage.
Power & Cooling¶
- Redundant power: A+B feeds from separate circuits/UPS
- PDU types: basic, metered (shows per-outlet draw), switched (remote on/off), ATS (automatic transfer switch)
- Cooling: hot aisle / cold aisle containment, blanking panels prevent recirculation
- Monitoring: inlet temp sensors, ASHRAE A1 recommended: 18-27C (64-80F)
What Experienced People Know¶
- A single loose cable can take down a production service. Label everything.
- iDRAC/iLO saves you a 2am datacenter drive. Always verify OOB access works before you need it.
- RAID rebuild on a degraded array with large disks (8TB+) can take 12-24 hours. During rebuild, another disk failure means data loss.
- "It's a hardware problem" is often a firmware problem. Check firmware versions first.
- UPS battery replacement is scheduled maintenance; missing it is an outage waiting to happen.
- The BMC is always on. A "powered off" server still has an active network endpoint. Plan your security model around this.
- PXE is fragile. DHCP races, TFTP timeouts, and UEFI vs BIOS mismatches cause most provisioning failures.
- Serial console is your last resort before driving to the datacenter. Always configure it.
- IPMI over LAN uses UDP 623 and has had multiple remote code execution CVEs. Treat the OOB network as a high-security zone.
- Kickstart %post failures are silent by default. Always add
set -eand logging to your post scripts.
OOB Management Deep Dive¶
The BMC — A Computer Inside Your Computer¶
Every modern server has a Baseboard Management Controller: a small, always-on computer with its own NIC, IP, OS, and web interface. It runs even when the host is powered off.
| Vendor | Name | CLI Tool |
|---|---|---|
| IPMI | Generic std | ipmitool |
| HP | iLO | hpilo, web UI |
| Dell | iDRAC | racadm, web UI |
| Super | IPMI/BMC | ipmitool |
| Lenovo | XClarity / IMM | OneCLI, web UI |
What the BMC can do: Power on/off/cycle/reset, remote console (KVM over IP), mount virtual media (ISO over network), read hardware sensors, access BIOS/UEFI settings, view/clear system event logs (SEL).
# Power status
ipmitool -I lanplus -H 10.0.1.50 -U admin -P secret chassis power status
# Power cycle
ipmitool -I lanplus -H 10.0.1.50 -U admin -P secret chassis power cycle
# Read sensor data
ipmitool -I lanplus -H 10.0.1.50 -U admin -P secret sensor list
# Set boot device to PXE for next boot only
ipmitool -I lanplus -H 10.0.1.50 -U admin -P secret chassis bootdev pxe
Serial-over-LAN (SOL)¶
When the BMC KVM is laggy or broken, serial console is the fallback:
SOL is critical for watching boot failures, kernel panics, and GRUB problems that occur before the network stack loads.
The PXE Boot Sequence¶
Server powers on -> NIC sends DHCP Discover (option 60: PXEClient)
-> DHCP replies with IP + next-server (TFTP) + filename (bootloader)
-> NIC downloads bootloader via TFTP -> Bootloader downloads kernel + initrd
-> Kernel boots, installer starts -> Installer fetches kickstart/preseed via HTTP
-> OS installs to disk, reboots -> Cloud-init runs on first boot
Kickstart, Preseed, and Cloud-init¶
These automate OS installation so no human touches a keyboard. Kickstart for RHEL/CentOS/Rocky, preseed for Debian/Ubuntu, cloud-init for post-install configuration on all distros.
OOB Network Security¶
War story: In 2013, researcher Dan Farmer found that IPMI 2.0's RAKP authentication protocol leaks password hashes to unauthenticated attackers — a design flaw in the specification itself, not a vendor bug. This means any BMC reachable over the network can have its admin password hash extracted and cracked offline. This is why the OOB network must be physically isolated from untrusted networks. IPMI CVEs (including remote code execution) are discovered regularly; treat every BMC as a high-value attack target.
- Dedicated OOB VLAN. Never put BMC IPs on the production network.
- Change default credentials. IPMI ships with admin/admin or similar.
- Disable IPMI-over-LAN if using Redfish. Reduce attack surface.
- Patch BMC firmware. BMC vulns give full host control.
Redfish¶
Redfish is the modern replacement for IPMI — a RESTful API (JSON over HTTPS) supported by all major vendors. Use it for automation instead of ipmitool where possible.
See Also¶
- Primers: Dell PowerEdge Servers
- Deep dives: Dell PowerEdge, RAID & Storage
- Guides: Bare Metal Provisioning, Dell Server Management, Rack Operations
- Cheatsheet: Datacenter
- Skillcheck: Datacenter
- Scenarios: Server Won't Boot, RAID Degraded
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Next Steps¶
- Bare-Metal Provisioning (Topic Pack, L2)
- Case Study: BIOS Settings Reset After CMOS (Case Study, L1)
- Case Study: BMC Clock Skew Cert Failure (Case Study, L2)
- Case Study: Backup Job Failing — iSCSI Target Unreachable, VLAN Misconfigured (Case Study, L2)
- Case Study: Bonding Failover Not Working (Case Study, L1)
- Case Study: Cable Management Wrong Port (Case Study, L1)
- Case Study: Database Replication Lag — Root Cause Is RAID Degradation (Case Study, L2)
- Case Study: Disk Full Root Services Down (Case Study, L1)
Related Content¶
- Dell PowerEdge Servers (Topic Pack, L1) — Firmware / BIOS / UEFI, Out-of-Band Management, RAID
- Skillcheck: Datacenter (Assessment, L1) — Out-of-Band Management, Rack & Stack, RAID
- Bare-Metal Provisioning (Topic Pack, L2) — Out-of-Band Management, PXE / Provisioning, Server Hardware
- Redfish API (Topic Pack, L1) — Firmware / BIOS / UEFI, Out-of-Band Management, Server Hardware
- Case Study: BIOS Settings Reset After CMOS (Case Study, L1) — Firmware / BIOS / UEFI, Server Hardware
- Case Study: Cable Management Wrong Port (Case Study, L1) — Rack & Stack, Server Hardware
- Case Study: Database Replication Lag — Root Cause Is RAID Degradation (Case Study, L2) — RAID, Server Hardware
- Case Study: Firmware Update Boot Loop (Case Study, L2) — Firmware / BIOS / UEFI, Server Hardware
- Case Study: Link Flaps Bad Optic (Case Study, L1) — Rack & Stack, Server Hardware
- Case Study: OS Install Fails RAID Controller (Case Study, L2) — Firmware / BIOS / UEFI, RAID
Pages that link here¶
- Anti-Primer: Datacenter
- Bare-Metal Provisioning
- Datacenter & Server Hardware
- Datacenter Operations Cheat Sheet
- Datacenter Ops Domain
- Datacenter Skillcheck
- Dell PowerEdge Servers
- Dell PowerEdge on Linux - Deep Dive Guide
- Dell Server Management
- HBA Firmware Mismatch Causing I/O Errors
- Incident Replay: BIOS Settings Reverted After CMOS Battery Replacement
- Incident Replay: BMC Clock Skew Causes Certificate Failure
- Incident Replay: Cable Plugged Into Wrong Port
- Incident Replay: HBA Firmware Mismatch
- Incident Replay: Link Flaps from Bad Optic