Datacenter & Server Hardware - Street Ops¶
What experienced infrastructure engineers know that textbooks don't teach.
Incident Runbooks¶
Server Won't Boot¶
1. Check iDRAC/iLO - is the server powered on?
- If iDRAC unreachable: check network cable to dedicated iDRAC port
- If iDRAC shows "off": power cycle via iDRAC or physical button
2. Check POST codes / LCD panel
- Memory training failure: reseat DIMMs, check for amber LED on DIMM slot
- RAID controller init hang: check disk LEDs, look for degraded/failed array
- PXE timeout: network config or DHCP issue
3. Check system event log (SEL) via iDRAC
racadm getsel # Dell
hpasmcli -s show sel # HPE
4. Common fixes:
- Reseat RAM / PCIe cards
- Clear CMOS / reset BIOS to defaults
- Replace CMOS battery (server sits powered off for months)
- Check PSU LEDs: both should be solid green
RAID Degraded¶
1. Identify failed disk:
perccli /c0 show # Dell PERC
storcli /c0 show # LSI/Broadcom
# Look for "DG/VD" state: Dgrd = degraded
2. Find the physical disk:
perccli /c0/e252/s0 show all # Check each slot
# LED blink to identify:
perccli /c0/e252/s3 start locate
3. Check if hot spare kicked in:
perccli /c0 show # Look for "GHS" (Global Hot Spare)
4. DO NOT:
- Reboot during rebuild (restarts the rebuild)
- Pull a second disk "to check" (kills the array)
- Ignore a degraded alert (another failure = data loss)
> **War story:** An engineer saw a RAID-5 degraded alert and pulled a second disk to "reseat it." The array collapsed instantly — RAID-5 tolerates exactly one drive failure. The data was unrecoverable. Rule: never touch a second disk in a degraded array. RAID-5 with large drives (>2TB) is especially risky because the rebuild time (hours to days) increases the window for a second failure. RAID-6 or RAID-10 are strongly preferred for drives over 2TB.
5. After replacement:
- Verify rebuild started automatically
- Monitor rebuild progress: perccli /c0/v0 show
- Typical rebuild time: 4-24 hours depending on array size
NIC Flapping¶
1. Check link status:
ethtool eth0 # Look for "Link detected: yes/no"
ip -s link show eth0 # Check error counters
dmesg | grep -i eth0 # Link up/down messages
2. Check for errors:
ethtool -S eth0 | grep -i err # NIC error counters
# rx_crc_errors: bad cable or port
# rx_missed_errors: NIC can't keep up (check ring buffer)
3. Common causes:
- Bad cable (swap it; most common cause)
- Bad SFP/transceiver (swap it)
- Speed/duplex mismatch (ethtool -s eth0 speed 1000 duplex full)
- Switch port issue (check switch logs)
4. Nuclear option:
ethtool -r eth0 # Reset NIC
ip link set eth0 down && ip link set eth0 up
Kernel Panic¶
1. Check iDRAC console for panic message (often scrolls past on physical console)
2. Check /var/log/journal (if persistent journal enabled):
journalctl -b -1 # Previous boot
journalctl -k -b -1 # Kernel messages from previous boot
3. Common causes:
- Bad RAM: run memtest86+ (boot from USB)
- Disk I/O errors: check dmesg for "I/O error" or "Buffer I/O error"
- Driver issues: recent kernel update? Check if rollback fixes it
- OOM killer went nuclear: check /var/log/messages for "Out of memory"
4. Enable kdump for future panics:
systemctl enable kdump
# Captures crash dump for analysis
Remember: Kernel panic diagnosis mnemonic: R-D-D-O — RAM (memtest86+), Disk I/O errors (dmesg), Drivers (recent kernel update?), OOM killer (check /var/log/messages). Work through them in this order — hardware faults are the most common root cause of kernel panics on production servers.
Fans Screaming¶
1. High fan speed usually means:
- High CPU/inlet temp (check iDRAC thermal readings)
- Non-Dell/HP drive in a certified slot (firmware doesn't recognize it)
- Missing blanking panel (hot air recirculation)
- Failed fan (other fans compensate at higher speed)
2. Check temps:
ipmitool sensor list | grep -i temp
# Or via iDRAC web UI: System > Temperatures
3. Check fan status:
ipmitool sensor list | grep -i fan
# Look for fans showing 0 RPM = failed
4. Common fixes:
- Clean dust from intake (compressed air)
- Replace blanking panels in empty bays
- For non-certified drives: some vendors have IPMI commands to override fan policy
- Replace failed fan module (most are hot-swap)
Gotchas & War Stories¶
Firmware version skew kills clusters - If you have 20 identical servers but 5 different BIOS versions, you'll get inconsistent behavior under load. Standardize firmware before deploying workloads.
The cable you didn't label
- You'll spend 2 hours tracing a cable that would have taken 5 seconds to read a label. Label both ends. Use a consistent scheme: RACK-U-PORT (e.g., R04-U22-P1).
SMART doesn't always warn you - SSDs can fail without SMART warnings (firmware bug, capacitor failure). Don't trust SMART alone; monitor I/O latency spikes.
iDRAC default passwords
- Dell ships iDRAC with root/calvin. Change it. Use an Ansible playbook to set credentials across your fleet.
PXE boot order matters - If you put PXE first in boot order, a server that loses its OS disk will happily reinstall itself from your provisioning server. Some people want this; some don't. Know which camp you're in.
Disk predictive failure != imminent failure - A SMART predictive failure alert means "this disk will probably fail eventually." Could be weeks or months. Don't panic, but do schedule replacement during the next maintenance window.
Under the hood: Dell iDRAC and HP iLO System Event Logs (SEL) have a fixed-size buffer (typically 512-2048 entries). Once full, new events are silently dropped unless you clear the log. Export the SEL to a file (
ipmitool sel elist > /var/log/sel-backup.log) before clearing. Best practice: ship SEL entries to your central syslog via SNMP traps or Redfish event subscriptions so you never lose hardware events.
Troubleshooting Commands Cheatsheet¶
# Hardware inventory
lshw -short # Full hardware summary
dmidecode -t system # System manufacturer, model, serial
dmidecode -t memory # DIMM layout, speeds, sizes
lspci # PCIe devices
lsblk # Block device tree
# Disk health
smartctl -a /dev/sda # Full SMART report
smartctl -H /dev/sda # Quick health check
smartctl -t short /dev/sda # Run short self-test
# Storage controller
perccli /c0 show # Dell PERC status
storcli /c0 show # LSI controller status
megacli -LDInfo -Lall -aALL # MegaRAID status
# Network
ethtool eth0 # NIC settings, link status
ethtool -i eth0 # Driver info
ethtool -S eth0 # NIC statistics/counters
ip -s link show eth0 # Interface stats
# System logs
journalctl -p err -b # Errors since boot
journalctl -k # Kernel messages
dmesg -T # Kernel ring buffer (human timestamps)
mcelog --client # Machine check exceptions (hardware errors)
# IPMI (out-of-band from another machine)
ipmitool -H <iDRAC-IP> -U root -P <pass> chassis status
ipmitool -H <iDRAC-IP> -U root -P <pass> sel list # System event log
ipmitool -H <iDRAC-IP> -U root -P <pass> sensor list # All sensors
# Dell specific
racadm getconfig -g cfgServerInfo # Server info
racadm getsel # System event log
racadm getsysinfo # System summary
Quick Reference¶
- Cheatsheet: Datacenter