Datacenter & Server Hardware - Street Ops¶

What experienced infrastructure engineers know that textbooks don't teach.

Incident Runbooks¶

Server Won't Boot¶

1. Check iDRAC/iLO - is the server powered on?
   - If iDRAC unreachable: check network cable to dedicated iDRAC port
   - If iDRAC shows "off": power cycle via iDRAC or physical button

2. Check POST codes / LCD panel
   - Memory training failure: reseat DIMMs, check for amber LED on DIMM slot
   - RAID controller init hang: check disk LEDs, look for degraded/failed array
   - PXE timeout: network config or DHCP issue

3. Check system event log (SEL) via iDRAC
   racadm getsel        # Dell
   hpasmcli -s show sel # HPE

4. Common fixes:
   - Reseat RAM / PCIe cards
   - Clear CMOS / reset BIOS to defaults
   - Replace CMOS battery (server sits powered off for months)
   - Check PSU LEDs: both should be solid green

RAID Degraded¶

1. Identify failed disk:
   perccli /c0 show        # Dell PERC
   storcli /c0 show        # LSI/Broadcom
   # Look for "DG/VD" state: Dgrd = degraded

2. Find the physical disk:
   perccli /c0/e252/s0 show all    # Check each slot
   # LED blink to identify:
   perccli /c0/e252/s3 start locate

3. Check if hot spare kicked in:
   perccli /c0 show        # Look for "GHS" (Global Hot Spare)

4. DO NOT:
   - Reboot during rebuild (restarts the rebuild)
   - Pull a second disk "to check" (kills the array)
   - Ignore a degraded alert (another failure = data loss)

> **War story:** An engineer saw a RAID-5 degraded alert and pulled a second disk to "reseat it." The array collapsed instantly — RAID-5 tolerates exactly one drive failure. The data was unrecoverable. Rule: never touch a second disk in a degraded array. RAID-5 with large drives (>2TB) is especially risky because the rebuild time (hours to days) increases the window for a second failure. RAID-6 or RAID-10 are strongly preferred for drives over 2TB.

5. After replacement:
   - Verify rebuild started automatically
   - Monitor rebuild progress: perccli /c0/v0 show
   - Typical rebuild time: 4-24 hours depending on array size

NIC Flapping¶

1. Check link status:
   ethtool eth0           # Look for "Link detected: yes/no"
   ip -s link show eth0   # Check error counters
   dmesg | grep -i eth0   # Link up/down messages

2. Check for errors:
   ethtool -S eth0 | grep -i err    # NIC error counters
   # rx_crc_errors: bad cable or port
   # rx_missed_errors: NIC can't keep up (check ring buffer)

3. Common causes:
   - Bad cable (swap it; most common cause)
   - Bad SFP/transceiver (swap it)
   - Speed/duplex mismatch (ethtool -s eth0 speed 1000 duplex full)
   - Switch port issue (check switch logs)

4. Nuclear option:
   ethtool -r eth0        # Reset NIC
   ip link set eth0 down && ip link set eth0 up

Kernel Panic¶

1. Check iDRAC console for panic message (often scrolls past on physical console)

2. Check /var/log/journal (if persistent journal enabled):
   journalctl -b -1       # Previous boot
   journalctl -k -b -1    # Kernel messages from previous boot

3. Common causes:
   - Bad RAM: run memtest86+ (boot from USB)
   - Disk I/O errors: check dmesg for "I/O error" or "Buffer I/O error"
   - Driver issues: recent kernel update? Check if rollback fixes it
   - OOM killer went nuclear: check /var/log/messages for "Out of memory"

4. Enable kdump for future panics:
   systemctl enable kdump
   # Captures crash dump for analysis

Remember: Kernel panic diagnosis mnemonic: R-D-D-O — RAM (memtest86+), Disk I/O errors (dmesg), Drivers (recent kernel update?), OOM killer (check /var/log/messages). Work through them in this order — hardware faults are the most common root cause of kernel panics on production servers.

Fans Screaming¶

1. High fan speed usually means:
   - High CPU/inlet temp (check iDRAC thermal readings)
   - Non-Dell/HP drive in a certified slot (firmware doesn't recognize it)
   - Missing blanking panel (hot air recirculation)
   - Failed fan (other fans compensate at higher speed)

2. Check temps:
   ipmitool sensor list | grep -i temp
   # Or via iDRAC web UI: System > Temperatures

3. Check fan status:
   ipmitool sensor list | grep -i fan
   # Look for fans showing 0 RPM = failed

4. Common fixes:
   - Clean dust from intake (compressed air)
   - Replace blanking panels in empty bays
   - For non-certified drives: some vendors have IPMI commands to override fan policy
   - Replace failed fan module (most are hot-swap)

Gotchas & War Stories¶

Firmware version skew kills clusters - If you have 20 identical servers but 5 different BIOS versions, you'll get inconsistent behavior under load. Standardize firmware before deploying workloads.

The cable you didn't label - You'll spend 2 hours tracing a cable that would have taken 5 seconds to read a label. Label both ends. Use a consistent scheme: RACK-U-PORT (e.g., R04-U22-P1).

SMART doesn't always warn you - SSDs can fail without SMART warnings (firmware bug, capacitor failure). Don't trust SMART alone; monitor I/O latency spikes.

iDRAC default passwords - Dell ships iDRAC with root/calvin. Change it. Use an Ansible playbook to set credentials across your fleet.

PXE boot order matters - If you put PXE first in boot order, a server that loses its OS disk will happily reinstall itself from your provisioning server. Some people want this; some don't. Know which camp you're in.

Disk predictive failure != imminent failure - A SMART predictive failure alert means "this disk will probably fail eventually." Could be weeks or months. Don't panic, but do schedule replacement during the next maintenance window.

Under the hood: Dell iDRAC and HP iLO System Event Logs (SEL) have a fixed-size buffer (typically 512-2048 entries). Once full, new events are silently dropped unless you clear the log. Export the SEL to a file (ipmitool sel elist > /var/log/sel-backup.log) before clearing. Best practice: ship SEL entries to your central syslog via SNMP traps or Redfish event subscriptions so you never lose hardware events.

Troubleshooting Commands Cheatsheet¶

# Hardware inventory
lshw -short                    # Full hardware summary
dmidecode -t system            # System manufacturer, model, serial
dmidecode -t memory            # DIMM layout, speeds, sizes
lspci                          # PCIe devices
lsblk                         # Block device tree

# Disk health
smartctl -a /dev/sda           # Full SMART report
smartctl -H /dev/sda           # Quick health check
smartctl -t short /dev/sda     # Run short self-test

# Storage controller
perccli /c0 show               # Dell PERC status
storcli /c0 show               # LSI controller status
megacli -LDInfo -Lall -aALL    # MegaRAID status

# Network
ethtool eth0                   # NIC settings, link status
ethtool -i eth0                # Driver info
ethtool -S eth0                # NIC statistics/counters
ip -s link show eth0           # Interface stats

# System logs
journalctl -p err -b           # Errors since boot
journalctl -k                  # Kernel messages
dmesg -T                       # Kernel ring buffer (human timestamps)
mcelog --client                 # Machine check exceptions (hardware errors)

# IPMI (out-of-band from another machine)
ipmitool -H <iDRAC-IP> -U root -P <pass> chassis status
ipmitool -H <iDRAC-IP> -U root -P <pass> sel list    # System event log
ipmitool -H <iDRAC-IP> -U root -P <pass> sensor list  # All sensors

# Dell specific
racadm getconfig -g cfgServerInfo    # Server info
racadm getsel                         # System event log
racadm getsysinfo                     # System summary

Quick Reference¶

Cheatsheet: Datacenter