Skip to content

GrokDevOps Wiki

Datacenter Cheatsheet

grokdatum/grokdevops

Datacenter Operations Cheat Sheet¶

Name origin: POST stands for "Power-On Self-Test" — the BIOS/UEFI firmware's built-in diagnostic that runs before any operating system loads. The beep codes you hear (or don't) during POST are the motherboard's way of reporting hardware status when the display may not even be functional yet. BMC (Baseboard Management Controller) is the separate, always-on microcontroller that gives you out-of-band access even when the main OS is down.

Server Boot Sequence (POST)¶

Power on → BIOS/UEFI POST → Memory check → Disk detection →
Boot loader (GRUB) → Kernel → init/systemd → Services

# Check boot messages
dmesg | head -50                  # Kernel boot log
journalctl -b                    # Full boot log
systemd-analyze                  # Boot timing
systemd-analyze blame            # Slow services

IPMI / BMC (Out-of-Band Management)¶

# Remote power control
ipmitool -I lanplus -H bmc-ip -U admin -P pass power status
ipmitool -I lanplus -H bmc-ip -U admin -P pass power cycle
ipmitool -I lanplus -H bmc-ip -U admin -P pass power on

# Serial-over-LAN (remote console)
ipmitool -I lanplus -H bmc-ip -U admin -P pass sol activate

# Sensor readings
ipmitool -I lanplus -H bmc-ip -U admin -P pass sensor list
ipmitool -I lanplus -H bmc-ip -U admin -P pass sel elist  # Event log

Remember: IPMI/BMC credentials are often factory defaults (admin/admin, ADMIN/ADMIN, root/calvin for Dell iDRAC). These are a major security risk if left unchanged — BMC has full hardware control including power, console, and firmware flash. Always change BMC passwords and isolate BMC interfaces on a separate management VLAN.

RAID Levels¶

Level	Min Disks	Fault Tolerance	Use Case
0	2	None (striping)	Temp/scratch data
1	2	1 disk (mirror)	OS / boot drives
5	3	1 disk (parity)	General storage
6	4	2 disks (dual parity)	Large arrays
10	4	1 per mirror pair	Databases, high I/O

# Software RAID (mdadm)
cat /proc/mdstat                 # RAID status
mdadm --detail /dev/md0          # Array details
mdadm --manage /dev/md0 --add /dev/sdb1  # Add spare

# Hardware RAID (varies by vendor)
megacli -LDInfo -Lall -aALL      # MegaRAID
storcli /c0 show                 # StorCLI
ssacli ctrl all show config      # HP SmartArray

Remember: RAID level mnemonics: 0 = Zero redundancy (striping), 1 = One mirror, 5 = Single parity (survives 1 disk loss), 6 = Dual parity (survives 2), 10 = RAID 1+0 (mirrors then stripes). RAID 5 is risky for large drives because a second disk failure during rebuild (which can take days on 4TB+ drives) means total data loss. RAID 6 or RAID 10 is preferred for large arrays.

Disk Health (SMART)¶

smartctl -a /dev/sda              # Full SMART data
smartctl -H /dev/sda              # Quick health check
smartctl -t short /dev/sda        # Run short self-test

# Key SMART attributes to watch:
# 5   Reallocated_Sector_Ct    # Bad sectors (critical)
# 187 Reported_Uncorrectable   # Read errors
# 197 Current_Pending_Sector   # Sectors waiting remap
# 198 Offline_Uncorrectable    # Offline bad sectors
# 194 Temperature_Celsius      # Drive temperature

Debug clue: SMART attribute #5 (Reallocated_Sector_Ct) is the single most important predictor of imminent disk failure. Any non-zero value means the drive is actively remapping bad sectors. A rising count means the drive is dying — replace it proactively before it takes your data with it.

PXE Boot¶

Server power on → DHCP request (with PXE option) →
DHCP reply (IP + TFTP server) → Download bootloader →
Download kernel + initrd → Boot into installer/image

Requirements: - DHCP server (option 66/67 or ISC dhcpd next-server) - TFTP server (pxelinux.0, kernel, initrd) - HTTP server (kickstart/preseed/cloud-init configs)

Network Cabling¶

Type	Max Distance	Speed	Use
Cat5e	100m	1 Gbps	Legacy
Cat6	55-100m	10 Gbps	Standard
Cat6a	100m	10 Gbps	Recommended
OM3 fiber	300m	10 Gbps	Inter-rack
OM4 fiber	400m	10-40 Gbps	Inter-rack
Single-mode	10+ km	100+ Gbps	Between buildings

Power & Cooling¶

Power:
  UPS → PDU → Server PSU (redundant A+B feeds)
  Calculate: watts per rack, PUE (Power Usage Effectiveness)
  PUE = Total Facility Power / IT Equipment Power
  Good PUE: 1.2-1.4

Cooling:
  Hot aisle / Cold aisle containment
  Monitor inlet temperature (target: 18-27°C / 64-80°F)
  Humidity: 40-60% RH

Capacity Planning¶

Per-server metrics to track:
  - CPU utilization (avg + peak)
  - Memory usage (committed vs available)
  - Disk I/O (IOPS + throughput)
  - Network throughput
  - Storage growth rate

Planning formula:
  Headroom = 100% - current_utilization
  Months until full = headroom / monthly_growth_rate
  Order lead time: 4-12 weeks for hardware

Hardware Troubleshooting¶

Symptom	Check
No POST	PSU, RAM seating, BMC event log
Degraded RAID	RAID controller, failed disk, rebuild status
Thermal shutdown	Fan failure, dust, ambient temp
Network flapping	Cable, SFP/transceiver, switch port
Random reboots	PSU, RAM ECC errors, kernel panic logs
Slow disk I/O	SMART errors, RAID rebuild, disk saturation

# Quick hardware check
dmidecode -t system              # System info
dmidecode -t memory              # RAM details
lspci                            # PCI devices
lscpu                            # CPU info
edac-util -s                     # ECC memory errors
mcelog --client                  # Machine check exceptions

Pages that link here¶