Skip to content

Datacenter Advanced Operations

Runbook-style guide for PXE triage, firmware management, RAID realities, and hardware failure patterns.

PXE Boot Failure Triage

PXE boot is a chain: DHCP -> TFTP -> bootloader -> kernel + initrd -> OS installer. Each link can break independently.

Step 1: DHCP

The server sends a DHCPDISCOVER broadcast. The DHCP server must respond with an IP address AND the next-server (TFTP) address plus the boot filename.

# On the DHCP server, check for the client's request
journalctl -u isc-dhcp-server --since "10 minutes ago" | grep <MAC>
# Or for dnsmasq:
journalctl -u dnsmasq --since "10 minutes ago" | grep <MAC>

# No entry? The broadcast is not reaching the server.
# Check:
# - Server and client on the same VLAN (or DHCP relay configured on the switch)
# - ip helper-address on the switch/router pointing to DHCP server
# - DHCP server listening on the correct interface

# Verify DHCP config includes PXE options
# ISC DHCP example:
#   next-server 10.0.1.50;            # TFTP server IP
#   filename "pxelinux.0";            # BIOS boot
#   filename "grubx64.efi";           # UEFI boot

# Capture DHCP traffic to confirm offers are sent
tcpdump -i eth0 -nn port 67 or port 68

# Common failures:
# - DHCP pool exhausted (no available IPs)
# - MAC address filtering active (client MAC not whitelisted)
# - UEFI vs BIOS boot: wrong filename served (pxelinux.0 vs grubx64.efi)

Step 2: TFTP

After receiving the DHCP offer, the server downloads the bootloader via TFTP.

# Test TFTP from another machine
tftp 10.0.1.50 -c get pxelinux.0
# If it fails: TFTP server is down or file is missing

# Check TFTP service
systemctl status tftpd-hpa
# Or: systemctl status atftpd

# Verify the boot file exists
ls -la /var/lib/tftpboot/pxelinux.0
ls -la /var/lib/tftpboot/grubx64.efi

# Check permissions (TFTP runs as nobody/tftp user)
# Files must be world-readable

# Capture TFTP traffic
tcpdump -i eth0 -nn port 69
# You should see RRQ (read request) followed by DATA packets
# If you see ERROR packets, the file is missing or permissions are wrong

# Firewall check
iptables -L -n | grep 69
# TFTP uses UDP 69 for control, then high ports for data transfer

Step 3: Bootloader and Kernel

Once the bootloader loads, it reads its config to find the kernel and initrd paths.

# For pxelinux: check config in /var/lib/tftpboot/pxelinux.cfg/
# Files are checked in this order:
#   01-<MAC>           (e.g., 01-aa-bb-cc-dd-ee-ff)
#   <HEX-IP>          (e.g., 0A000164 for 10.0.1.100)
#   default            (fallback)

# For GRUB: check /var/lib/tftpboot/grub/grub.cfg

# Example pxelinux config:
# DEFAULT linux
# LABEL linux
#   KERNEL vmlinuz
#   APPEND initrd=initrd.img ip=dhcp inst.repo=http://10.0.1.50/centos9

# Verify kernel and initrd are present
ls -la /var/lib/tftpboot/vmlinuz
ls -la /var/lib/tftpboot/initrd.img

# If the kernel loads but the installer fails:
# - Check the kickstart/preseed URL is reachable from the target network
# - Check HTTP server logs: access_log should show the target requesting files
curl -I http://10.0.1.50/centos9/  # from the target's network

Firmware Update Gotchas

Update Order Matters

Firmware components have dependencies. Update in the wrong order and you brick hardware.

# General safe order:
# 1. BMC/iDRAC/iLO firmware (management controller)
# 2. BIOS/UEFI
# 3. Storage controller (PERC, MegaRAID)
# 4. NIC firmware
# 5. Disk firmware (last -- most dangerous)

# Dell example using racadm:
# Check current versions
racadm getversion

# Update BMC first (via racadm remote)
racadm -r <iDRAC-IP> -u root -p <pass> update -f firmimg.d9 -l /tmp/

# Update BIOS (requires reboot)
racadm -r <iDRAC-IP> -u root -p <pass> jobqueue create BIOS.Setup.1-1
# Schedule at next reboot, do not cold-reboot during BIOS flash

Critical Rules

  • Never power-cycle during a firmware flash. Connect to UPS. If power drops during BIOS update, the server is a paperweight.
  • BMC reset after BMC update: The management controller will reboot itself. Wait 3-5 minutes before attempting to reconnect. Do not panic.
  • Stage firmware, do not apply live: Upload firmware and schedule installation for the next maintenance window. Do not apply during production hours.
  • Batch consistency: If you have 50 identical servers, they must all run the same firmware matrix. Version skew causes inconsistent behavior under load. Use a firmware compliance tool (Dell Repository Manager, HPE SUM).
# HPE: Smart Update Manager (SUM) for batch updates
hpsum /s /use_latest /allow_non_bundle_components

# Dell: Dell System Update (DSU)
dsu --inventory     # Show current firmware
dsu --apply-upgrades --auto-reboot   # Apply all available updates

RAID Controller Realities

BBU/Cache Behavior

RAID controllers have a battery-backed cache (BBU) or flash-backed cache (FBU) that enables write-back caching. When the battery dies, performance craters.

# Check BBU/cache status
perccli /c0 show all | grep -i "cache\|bbu\|battery"
storcli /c0/bbu show

# States you care about:
# - Optimal/Ready: battery is healthy, write-back enabled
# - Learning: battery is recalibrating (periodic, normal, but writes go write-through)
# - Failed/Missing: NO battery, controller forces write-through mode

# Write-back vs write-through:
# Write-back: controller acknowledges write after writing to cache (fast, needs battery)
# Write-through: controller acknowledges write after writing to disk (slow, safe)
# Performance difference: 2-10x for random writes

# Check current write policy
perccli /c0/v0 show | grep "Cache"
# WB = write-back (good), WT = write-through (safe but slow)

# If BBU failed and you MUST have write-back (accepting risk):
# Force write-back without battery (data loss risk on power failure)
perccli /c0/v0 set wrcache=AWB   # "Always Write Back" -- DANGEROUS

Patrol Reads

Patrol reads proactively scan all disks for bad sectors. They run in the background and cause slight IO overhead.

# Check patrol read status
perccli /c0 show patrolread

# If patrol read is causing IO impact during business hours:
perccli /c0 set patrolread mode=manual
# Run patrol reads during maintenance windows instead:
perccli /c0 start patrolread

Hot Spare Behavior and Rebuild Monitoring

Types of Hot Spares

  • Dedicated: Assigned to a specific virtual disk. Rebuilds only that VD.
  • Global: Available to any virtual disk on the controller. First-come, first-served.
# List hot spares
perccli /c0 show | grep "GHS\|DHS"
# GHS = Global Hot Spare, DHS = Dedicated Hot Spare

# Assign a global hot spare
perccli /c0/e252/s7 add hotsparedrive

# When a disk fails:
# 1. Controller detects failure (seconds)
# 2. Hot spare is activated (automatic, unless manual rebuild is configured)
# 3. Rebuild starts immediately
# Monitor progress:
perccli /c0/v0 show rebuild
storcli /c0/v0 show rebuild

# Rebuild times depend on array size and IO load:
# 1TB drive, no load: ~2-4 hours
# 4TB drive, under load: 12-24 hours
# 8TB drive, heavy IO: 24-48 hours

# CRITICAL: During rebuild, the array is degraded
# Another disk failure = DATA LOSS (for RAID 5)
# This is why RAID 6 or RAID 10 is preferred for large drives

# After rebuild completes, replace the failed disk
# The old hot spare becomes a regular array member
# Install a new hot spare to restore redundancy
perccli /c0/e252/s3 add hotsparedrive

DIMM Error Patterns

The ECC Progression

Modern servers use ECC (Error-Correcting Code) memory. Single-bit errors are corrected silently. Multi-bit errors are uncorrectable and crash the system.

# Check for memory errors
edac-util -s                    # EDAC summary
edac-util -l                    # Per-DIMM error counts
# Or:
mcelog --client                 # Machine check exception log

# Check iDRAC/iLO system event log for memory events
racadm getsel | grep -i "memory\|dimm\|ecc"
ipmitool sel list | grep -i "memory\|correctable"

# Check kernel messages
dmesg | grep -i "edac\|mce\|memory\|ecc"
journalctl -k | grep -i "hardware error"

The Failure Pattern

Correctable ECC errors (CEs) are the canary in the coal mine:

Phase 1: Occasional CEs (1-2 per week) -- normal background rate for some DIMMs
Phase 2: Increasing CEs (multiple per day) -- DIMM is degrading, schedule replacement
Phase 3: CE storm (hundreds per hour) -- failure is imminent, replace NOW
Phase 4: Uncorrectable Error (UE) -- system crashes, data corruption possible
# Set up monitoring thresholds
# Alert at: > 10 correctable errors per hour on a single DIMM
# Page at: > 100 correctable errors per hour (replace immediately)
# Any uncorrectable error: emergency replacement

# Identify the failing DIMM slot
edac-util -l
# Shows which memory controller and DIMM slot has errors
# Cross-reference with dmidecode:
dmidecode -t memory | grep -A 5 "Locator"
# Maps slot labels (A1, B3, etc.) to physical positions

# Some vendors support online DIMM sparing:
# The system can remap a failing DIMM to a spare rank without downtime
# Check if available:
dmidecode -t memory | grep -i "spare\|mirror"

After DIMM Replacement

# After replacing the DIMM:
# 1. Server will re-train memory (adds 1-3 minutes to POST)
# 2. Verify the new DIMM is detected
dmidecode -t memory | grep -A 10 "Locator: A3"  # replaced slot
# 3. Clear EDAC counters
echo 0 > /sys/devices/system/edac/mc/mc0/reset_counters
# 4. Clear system event log
racadm clrsel
# 5. Monitor for 24 hours to confirm no errors on the new DIMM