Skip to content

Portal | Level: L1: Foundations | Topics: Rack & Stack, Out-of-Band Management, RAID, Firmware / BIOS / UEFI | Domain: Datacenter & Hardware

Datacenter & Server Hardware Ops - Primer

Why This Matters

Every cloud instance runs on physical hardware in a datacenter. When you manage on-prem infrastructure, troubleshoot cloud performance, or design for reliability, understanding the physical layer gives you an edge most DevOps engineers lack.

Core Concepts

The Rack

A standard datacenter rack is 42U (rack units) tall. One U = 1.75 inches. A 1U server is the thinnest; a 2U or 4U server has more drive bays, PCIe slots, and cooling capacity.

Fun fact: The 42U standard comes from the EIA-310 specification (Electronic Industries Alliance). The 19-inch width dates back to 1920s telephone relay racks. The same physical format has been used for over 100 years — from telephone switches to AI GPU clusters.

Key components in a rack: - Servers (compute) - 1U-4U each - Network switches - Top-of-Rack (ToR) or End-of-Row (EoR) - Patch panels - structured cabling termination - PDUs (Power Distribution Units) - A-feed and B-feed for redundancy - Cable management - vertical and horizontal organizers

Out-of-Band Management

Name origin: "Out-of-band" (OOB) means the management channel is separate from the primary data path. The term comes from telecommunications, where signaling (control) is carried on different frequencies or channels than voice (data). In datacenter context, the BMC's dedicated network port is "out of band" from the server's production NICs — so you can manage a server even when its OS, production network, or primary NIC is completely dead.

Every enterprise server has a dedicated management interface that works even when the OS is dead: - Dell: iDRAC (Integrated Dell Remote Access Controller) - HP/HPE: iLO (Integrated Lights-Out) - Supermicro: IPMI/BMC - Generic: IPMI (Intelligent Platform Management Interface)

These provide: remote console, power control, hardware monitoring, firmware updates, virtual media mounting.

BIOS/UEFI

The firmware layer between hardware and OS. Key settings: - Boot order - UEFI vs Legacy, PXE boot position - Performance profiles - max performance vs power saving - Virtualization - VT-x/VT-d, SR-IOV for NIC passthrough - Security - TPM, Secure Boot

Storage: RAID

RAID (Redundant Array of Independent Disks) provides redundancy and/or performance:

Level Min Disks Redundancy Use Case
RAID 0 2 None (striping) Temp data, scratch space
RAID 1 2 Mirror OS drives, boot volumes
RAID 5 3 1 disk failure General purpose (read-heavy)
RAID 6 4 2 disk failures Large arrays, safety margin
RAID 10 4 1 per mirror pair Databases, write-heavy

Controllers: Dell PERC, HP Smart Array, LSI MegaRAID. CLI tools: perccli, storcli, megacli.

Remember: RAID level mnemonic: "0 is Zero protection, 1 is a Mirror, 5 has One parity, 6 has Two parities, 10 is 1+0 (mirrors then stripes)." For production databases, RAID 10 is the gold standard — best write performance with per-pair redundancy. RAID 5 is common for read-heavy workloads, but rebuild times on large disks (8TB+) are dangerous — a second disk failure during rebuild means total data loss.

Gotcha: RAID is not a backup. RAID protects against hardware disk failure. It does NOT protect against accidental deletion, corruption, ransomware, or controller failure. A deleted file on RAID 10 is deleted across all mirrors simultaneously. Always combine RAID with backups.

Disk Health: SMART

Self-Monitoring, Analysis and Reporting Technology. Every disk tracks its own health metrics: - Reallocated Sector Count - bad sectors remapped; rising count = failing disk - Current Pending Sector - sectors waiting to be remapped - Temperature - sustained high temps reduce lifespan - Power-On Hours - total runtime

Tool: smartctl -a /dev/sda

Provisioning Pipeline

How a bare-metal server goes from "box on the dock" to "running in production":

Rack & cable -> BIOS config -> iDRAC setup -> PXE boot -> OS install
-> Ansible config -> Join cluster -> Workload scheduling

Key technologies: PXE, DHCP, TFTP, kickstart/preseed/cloud-init, MAAS, Foreman, Ironic.

Name origin: PXE stands for Preboot eXecution Environment (pronounced "pixie"). It was developed by Intel in 1999 as part of the Wired for Management specification. The NIC firmware contains a PXE ROM that acts as a mini DHCP client and TFTP downloader — enough to bootstrap a full OS installer over the network without any local storage.

Power & Cooling

  • Redundant power: A+B feeds from separate circuits/UPS
  • PDU types: basic, metered (shows per-outlet draw), switched (remote on/off), ATS (automatic transfer switch)
  • Cooling: hot aisle / cold aisle containment, blanking panels prevent recirculation
  • Monitoring: inlet temp sensors, ASHRAE A1 recommended: 18-27C (64-80F)

What Experienced People Know

  • A single loose cable can take down a production service. Label everything.
  • iDRAC/iLO saves you a 2am datacenter drive. Always verify OOB access works before you need it.
  • RAID rebuild on a degraded array with large disks (8TB+) can take 12-24 hours. During rebuild, another disk failure means data loss.
  • "It's a hardware problem" is often a firmware problem. Check firmware versions first.
  • UPS battery replacement is scheduled maintenance; missing it is an outage waiting to happen.
  • The BMC is always on. A "powered off" server still has an active network endpoint. Plan your security model around this.
  • PXE is fragile. DHCP races, TFTP timeouts, and UEFI vs BIOS mismatches cause most provisioning failures.
  • Serial console is your last resort before driving to the datacenter. Always configure it.
  • IPMI over LAN uses UDP 623 and has had multiple remote code execution CVEs. Treat the OOB network as a high-security zone.
  • Kickstart %post failures are silent by default. Always add set -e and logging to your post scripts.

OOB Management Deep Dive

The BMC — A Computer Inside Your Computer

Every modern server has a Baseboard Management Controller: a small, always-on computer with its own NIC, IP, OS, and web interface. It runs even when the host is powered off.

Vendor Name CLI Tool
IPMI Generic std ipmitool
HP iLO hpilo, web UI
Dell iDRAC racadm, web UI
Super IPMI/BMC ipmitool
Lenovo XClarity / IMM OneCLI, web UI

What the BMC can do: Power on/off/cycle/reset, remote console (KVM over IP), mount virtual media (ISO over network), read hardware sensors, access BIOS/UEFI settings, view/clear system event logs (SEL).

# Power status
ipmitool -I lanplus -H 10.0.1.50 -U admin -P secret chassis power status

# Power cycle
ipmitool -I lanplus -H 10.0.1.50 -U admin -P secret chassis power cycle

# Read sensor data
ipmitool -I lanplus -H 10.0.1.50 -U admin -P secret sensor list

# Set boot device to PXE for next boot only
ipmitool -I lanplus -H 10.0.1.50 -U admin -P secret chassis bootdev pxe

Serial-over-LAN (SOL)

When the BMC KVM is laggy or broken, serial console is the fallback:

ipmitool -I lanplus -H 10.0.1.50 -U admin -P secret sol activate

SOL is critical for watching boot failures, kernel panics, and GRUB problems that occur before the network stack loads.

The PXE Boot Sequence

Server powers on -> NIC sends DHCP Discover (option 60: PXEClient)
-> DHCP replies with IP + next-server (TFTP) + filename (bootloader)
-> NIC downloads bootloader via TFTP -> Bootloader downloads kernel + initrd
-> Kernel boots, installer starts -> Installer fetches kickstart/preseed via HTTP
-> OS installs to disk, reboots -> Cloud-init runs on first boot

Kickstart, Preseed, and Cloud-init

These automate OS installation so no human touches a keyboard. Kickstart for RHEL/CentOS/Rocky, preseed for Debian/Ubuntu, cloud-init for post-install configuration on all distros.

OOB Network Security

War story: In 2013, researcher Dan Farmer found that IPMI 2.0's RAKP authentication protocol leaks password hashes to unauthenticated attackers — a design flaw in the specification itself, not a vendor bug. This means any BMC reachable over the network can have its admin password hash extracted and cracked offline. This is why the OOB network must be physically isolated from untrusted networks. IPMI CVEs (including remote code execution) are discovered regularly; treat every BMC as a high-value attack target.

  • Dedicated OOB VLAN. Never put BMC IPs on the production network.
  • Change default credentials. IPMI ships with admin/admin or similar.
  • Disable IPMI-over-LAN if using Redfish. Reduce attack surface.
  • Patch BMC firmware. BMC vulns give full host control.

Redfish

Redfish is the modern replacement for IPMI — a RESTful API (JSON over HTTPS) supported by all major vendors. Use it for automation instead of ipmitool where possible.


See Also


Wiki Navigation

Prerequisites

Next Steps