Datacenter Operations Cheat Sheet¶
Name origin: POST stands for "Power-On Self-Test" — the BIOS/UEFI firmware's built-in diagnostic that runs before any operating system loads. The beep codes you hear (or don't) during POST are the motherboard's way of reporting hardware status when the display may not even be functional yet. BMC (Baseboard Management Controller) is the separate, always-on microcontroller that gives you out-of-band access even when the main OS is down.
Server Boot Sequence (POST)¶
Power on → BIOS/UEFI POST → Memory check → Disk detection →
Boot loader (GRUB) → Kernel → init/systemd → Services
# Check boot messages
dmesg | head -50 # Kernel boot log
journalctl -b # Full boot log
systemd-analyze # Boot timing
systemd-analyze blame # Slow services
IPMI / BMC (Out-of-Band Management)¶
# Remote power control
ipmitool -I lanplus -H bmc-ip -U admin -P pass power status
ipmitool -I lanplus -H bmc-ip -U admin -P pass power cycle
ipmitool -I lanplus -H bmc-ip -U admin -P pass power on
# Serial-over-LAN (remote console)
ipmitool -I lanplus -H bmc-ip -U admin -P pass sol activate
# Sensor readings
ipmitool -I lanplus -H bmc-ip -U admin -P pass sensor list
ipmitool -I lanplus -H bmc-ip -U admin -P pass sel elist # Event log
Remember: IPMI/BMC credentials are often factory defaults (
admin/admin,ADMIN/ADMIN,root/calvinfor Dell iDRAC). These are a major security risk if left unchanged — BMC has full hardware control including power, console, and firmware flash. Always change BMC passwords and isolate BMC interfaces on a separate management VLAN.
RAID Levels¶
| Level | Min Disks | Fault Tolerance | Use Case |
|---|---|---|---|
| 0 | 2 | None (striping) | Temp/scratch data |
| 1 | 2 | 1 disk (mirror) | OS / boot drives |
| 5 | 3 | 1 disk (parity) | General storage |
| 6 | 4 | 2 disks (dual parity) | Large arrays |
| 10 | 4 | 1 per mirror pair | Databases, high I/O |
# Software RAID (mdadm)
cat /proc/mdstat # RAID status
mdadm --detail /dev/md0 # Array details
mdadm --manage /dev/md0 --add /dev/sdb1 # Add spare
# Hardware RAID (varies by vendor)
megacli -LDInfo -Lall -aALL # MegaRAID
storcli /c0 show # StorCLI
ssacli ctrl all show config # HP SmartArray
Remember: RAID level mnemonics: 0 = Zero redundancy (striping), 1 = One mirror, 5 = Single parity (survives 1 disk loss), 6 = Dual parity (survives 2), 10 = RAID 1+0 (mirrors then stripes). RAID 5 is risky for large drives because a second disk failure during rebuild (which can take days on 4TB+ drives) means total data loss. RAID 6 or RAID 10 is preferred for large arrays.
Disk Health (SMART)¶
smartctl -a /dev/sda # Full SMART data
smartctl -H /dev/sda # Quick health check
smartctl -t short /dev/sda # Run short self-test
# Key SMART attributes to watch:
# 5 Reallocated_Sector_Ct # Bad sectors (critical)
# 187 Reported_Uncorrectable # Read errors
# 197 Current_Pending_Sector # Sectors waiting remap
# 198 Offline_Uncorrectable # Offline bad sectors
# 194 Temperature_Celsius # Drive temperature
Debug clue: SMART attribute #5 (Reallocated_Sector_Ct) is the single most important predictor of imminent disk failure. Any non-zero value means the drive is actively remapping bad sectors. A rising count means the drive is dying — replace it proactively before it takes your data with it.
PXE Boot¶
Server power on → DHCP request (with PXE option) →
DHCP reply (IP + TFTP server) → Download bootloader →
Download kernel + initrd → Boot into installer/image
Requirements: - DHCP server (option 66/67 or ISC dhcpd next-server) - TFTP server (pxelinux.0, kernel, initrd) - HTTP server (kickstart/preseed/cloud-init configs)
Network Cabling¶
| Type | Max Distance | Speed | Use |
|---|---|---|---|
| Cat5e | 100m | 1 Gbps | Legacy |
| Cat6 | 55-100m | 10 Gbps | Standard |
| Cat6a | 100m | 10 Gbps | Recommended |
| OM3 fiber | 300m | 10 Gbps | Inter-rack |
| OM4 fiber | 400m | 10-40 Gbps | Inter-rack |
| Single-mode | 10+ km | 100+ Gbps | Between buildings |
Power & Cooling¶
Power:
UPS → PDU → Server PSU (redundant A+B feeds)
Calculate: watts per rack, PUE (Power Usage Effectiveness)
PUE = Total Facility Power / IT Equipment Power
Good PUE: 1.2-1.4
Cooling:
Hot aisle / Cold aisle containment
Monitor inlet temperature (target: 18-27°C / 64-80°F)
Humidity: 40-60% RH
Capacity Planning¶
Per-server metrics to track:
- CPU utilization (avg + peak)
- Memory usage (committed vs available)
- Disk I/O (IOPS + throughput)
- Network throughput
- Storage growth rate
Planning formula:
Headroom = 100% - current_utilization
Months until full = headroom / monthly_growth_rate
Order lead time: 4-12 weeks for hardware
Hardware Troubleshooting¶
| Symptom | Check |
|---|---|
| No POST | PSU, RAM seating, BMC event log |
| Degraded RAID | RAID controller, failed disk, rebuild status |
| Thermal shutdown | Fan failure, dust, ambient temp |
| Network flapping | Cable, SFP/transceiver, switch port |
| Random reboots | PSU, RAM ECC errors, kernel panic logs |
| Slow disk I/O | SMART errors, RAID rebuild, disk saturation |
# Quick hardware check
dmidecode -t system # System info
dmidecode -t memory # RAM details
lspci # PCI devices
lscpu # CPU info
edac-util -s # ECC memory errors
mcelog --client # Machine check exceptions