Skip to content

Datacenter Operations Cheat Sheet

Name origin: POST stands for "Power-On Self-Test" — the BIOS/UEFI firmware's built-in diagnostic that runs before any operating system loads. The beep codes you hear (or don't) during POST are the motherboard's way of reporting hardware status when the display may not even be functional yet. BMC (Baseboard Management Controller) is the separate, always-on microcontroller that gives you out-of-band access even when the main OS is down.

Server Boot Sequence (POST)

Power on → BIOS/UEFI POST → Memory check → Disk detection →
Boot loader (GRUB) → Kernel → init/systemd → Services
# Check boot messages
dmesg | head -50                  # Kernel boot log
journalctl -b                    # Full boot log
systemd-analyze                  # Boot timing
systemd-analyze blame            # Slow services

IPMI / BMC (Out-of-Band Management)

# Remote power control
ipmitool -I lanplus -H bmc-ip -U admin -P pass power status
ipmitool -I lanplus -H bmc-ip -U admin -P pass power cycle
ipmitool -I lanplus -H bmc-ip -U admin -P pass power on

# Serial-over-LAN (remote console)
ipmitool -I lanplus -H bmc-ip -U admin -P pass sol activate

# Sensor readings
ipmitool -I lanplus -H bmc-ip -U admin -P pass sensor list
ipmitool -I lanplus -H bmc-ip -U admin -P pass sel elist  # Event log

Remember: IPMI/BMC credentials are often factory defaults (admin/admin, ADMIN/ADMIN, root/calvin for Dell iDRAC). These are a major security risk if left unchanged — BMC has full hardware control including power, console, and firmware flash. Always change BMC passwords and isolate BMC interfaces on a separate management VLAN.

RAID Levels

Level Min Disks Fault Tolerance Use Case
0 2 None (striping) Temp/scratch data
1 2 1 disk (mirror) OS / boot drives
5 3 1 disk (parity) General storage
6 4 2 disks (dual parity) Large arrays
10 4 1 per mirror pair Databases, high I/O
# Software RAID (mdadm)
cat /proc/mdstat                 # RAID status
mdadm --detail /dev/md0          # Array details
mdadm --manage /dev/md0 --add /dev/sdb1  # Add spare

# Hardware RAID (varies by vendor)
megacli -LDInfo -Lall -aALL      # MegaRAID
storcli /c0 show                 # StorCLI
ssacli ctrl all show config      # HP SmartArray

Remember: RAID level mnemonics: 0 = Zero redundancy (striping), 1 = One mirror, 5 = Single parity (survives 1 disk loss), 6 = Dual parity (survives 2), 10 = RAID 1+0 (mirrors then stripes). RAID 5 is risky for large drives because a second disk failure during rebuild (which can take days on 4TB+ drives) means total data loss. RAID 6 or RAID 10 is preferred for large arrays.

Disk Health (SMART)

smartctl -a /dev/sda              # Full SMART data
smartctl -H /dev/sda              # Quick health check
smartctl -t short /dev/sda        # Run short self-test

# Key SMART attributes to watch:
# 5   Reallocated_Sector_Ct    # Bad sectors (critical)
# 187 Reported_Uncorrectable   # Read errors
# 197 Current_Pending_Sector   # Sectors waiting remap
# 198 Offline_Uncorrectable    # Offline bad sectors
# 194 Temperature_Celsius      # Drive temperature

Debug clue: SMART attribute #5 (Reallocated_Sector_Ct) is the single most important predictor of imminent disk failure. Any non-zero value means the drive is actively remapping bad sectors. A rising count means the drive is dying — replace it proactively before it takes your data with it.

PXE Boot

Server power on → DHCP request (with PXE option) →
DHCP reply (IP + TFTP server) → Download bootloader →
Download kernel + initrd → Boot into installer/image

Requirements: - DHCP server (option 66/67 or ISC dhcpd next-server) - TFTP server (pxelinux.0, kernel, initrd) - HTTP server (kickstart/preseed/cloud-init configs)

Network Cabling

Type Max Distance Speed Use
Cat5e 100m 1 Gbps Legacy
Cat6 55-100m 10 Gbps Standard
Cat6a 100m 10 Gbps Recommended
OM3 fiber 300m 10 Gbps Inter-rack
OM4 fiber 400m 10-40 Gbps Inter-rack
Single-mode 10+ km 100+ Gbps Between buildings

Power & Cooling

Power:
  UPS → PDU → Server PSU (redundant A+B feeds)
  Calculate: watts per rack, PUE (Power Usage Effectiveness)
  PUE = Total Facility Power / IT Equipment Power
  Good PUE: 1.2-1.4

Cooling:
  Hot aisle / Cold aisle containment
  Monitor inlet temperature (target: 18-27°C / 64-80°F)
  Humidity: 40-60% RH

Capacity Planning

Per-server metrics to track:
  - CPU utilization (avg + peak)
  - Memory usage (committed vs available)
  - Disk I/O (IOPS + throughput)
  - Network throughput
  - Storage growth rate

Planning formula:
  Headroom = 100% - current_utilization
  Months until full = headroom / monthly_growth_rate
  Order lead time: 4-12 weeks for hardware

Hardware Troubleshooting

Symptom Check
No POST PSU, RAM seating, BMC event log
Degraded RAID RAID controller, failed disk, rebuild status
Thermal shutdown Fan failure, dust, ambient temp
Network flapping Cable, SFP/transceiver, switch port
Random reboots PSU, RAM ECC errors, kernel panic logs
Slow disk I/O SMART errors, RAID rebuild, disk saturation
# Quick hardware check
dmidecode -t system              # System info
dmidecode -t memory              # RAM details
lspci                            # PCI devices
lscpu                            # CPU info
edac-util -s                     # ECC memory errors
mcelog --client                  # Machine check exceptions