Skip to content

Portal | Level: L1: Foundations | Topics: Rack & Stack, Out-of-Band Management | Domain: Datacenter & Hardware

Datacenter & Hardware Drills

Remember: The physical troubleshooting order: Power (is it plugged in and on?) -> POST (does the BIOS initialize?) -> Network (can you reach the BMC/OS?) -> Storage (are disks healthy?). Mnemonic: "PPNS" — always start at the power cable before touching software. 80% of "server down" tickets in a datacenter are physical: loose cables, failed PSUs, dead fans.

Fun fact: IPMI (Intelligent Platform Management Interface) operates on a completely separate processor (the BMC) with its own network stack, independent of the server's main CPU and OS. This is why you can power-cycle a server with a frozen kernel — the BMC is a separate embedded computer on the motherboard.

Drill 1: Server Won't POST

Difficulty: Medium

Q: You rack a new server and it won't POST (no display, no beep codes). Walk through the physical troubleshooting.

Answer
1. Check power
   - Power cable seated? PSU switch on?
   - PDU showing output? Try a different outlet.
   - PSU LED on? If not, suspect dead PSU.

2. Check display
   - Monitor connected to the right port (not add-in GPU vs BMC)?
   - Try IPMI/iLO/iDRAC remote console.

3. Reseat components
   - RAM: reseat all DIMMs, try one DIMM at a time
   - CPU: check for bent pins (AMD) or socket debris (Intel)
   - PCIe cards: remove all add-in cards, try bare minimum

4. Listen for beep codes
   - No RAM: continuous beeping
   - Bad CPU: no beeps, no POST
   - Check vendor manual for specific beep patterns

5. BMC/IPMI
   - Connect to BMC network port
   - Access web UI or: ipmitool -I lanplus -H <bmc-ip> -U admin chassis status
   - Check hardware event log

Drill 2: RAID Levels

Difficulty: Easy

Q: Explain RAID 0, 1, 5, 6, and 10. Which do you use for databases?

Answer | RAID | Min Disks | Fault Tolerance | Space Efficiency | Use Case | |------|-----------|----------------|-----------------|----------| | 0 | 2 | None (stripe) | 100% | Temp data, cache | | 1 | 2 | 1 disk (mirror) | 50% | Boot drives, small DBs | | 5 | 3 | 1 disk (parity) | (n-1)/n | Read-heavy workloads | | 6 | 4 | 2 disks (dual parity) | (n-2)/n | Large arrays | | 10 | 4 | 1 per mirror pair | 50% | Databases, high IOPS | **For databases**: RAID 10 (mirrors + stripes). Best write performance, survives multiple disk failures (one per mirror pair). **Never for databases**: RAID 5 — write penalty from parity calculation, slow rebuild on large disks.

Drill 3: Network Cabling

Difficulty: Easy

Q: A server intermittently drops packets. NIC shows CRC errors and frame errors. What's likely wrong?

Answer
Physical layer issues:
1. Bad cable  try a known-good cable
2. Bad SFP/transceiver  swap it
3. Dirty fiber connector  clean with a lint-free wipe
4. Cable too long  check max distance:
   - Cat6: 100m for 1GbE/10GbE
   - OM3 fiber: 300m for 10GbE
   - OM4 fiber: 400m for 10GbE
   - Single-mode: 10+ km
5. Duplex mismatch  both ends should be auto-negotiate or forced to same

Check:
  ethtool eth0         # Speed, duplex, link status
  ethtool -S eth0      # Error counters
  ip -s link show eth0 # Packet stats
CRC/frame errors almost always = physical layer. Don't look at software until you've ruled out cables and optics.

Drill 4: IPMI/BMC

Difficulty: Medium

Q: The OS is unresponsive on a remote server. You can't SSH in. How do you recover without physical access?

Answer
# Connect to BMC (out-of-band management)
# iLO (HP), iDRAC (Dell), IPMI (generic)

# Check power status
ipmitool -I lanplus -H <bmc-ip> -U admin -P pass chassis status

# Open remote console (KVM over IP)
# Use web UI: https://<bmc-ip>

# Force power cycle (last resort)
ipmitool -I lanplus -H <bmc-ip> -U admin -P pass chassis power cycle

# Check hardware event log
ipmitool -I lanplus -H <bmc-ip> -U admin -P pass sel list

# Check sensor readings
ipmitool -I lanplus -H <bmc-ip> -U admin -P pass sensor list
BMC provides: - Remote console (KVM): see boot screen, BIOS, OS - Power control: on, off, cycle, reset - Hardware monitoring: temps, fans, voltages - Event logs: hardware errors, boot events

Drill 5: Disk Health Monitoring

Difficulty: Easy

Q: How do you check if a disk is about to fail before it actually fails?

Answer
# SMART health check
smartctl -a /dev/sda

# Key attributes to watch:
# Reallocated_Sector_Ct  — bad sectors remapped (rising = dying disk)
# Current_Pending_Sector — sectors waiting to be remapped
# Offline_Uncorrectable  — sectors that can't be read
# UDMA_CRC_Error_Count   — cable/connection issues

# Quick health test
smartctl -H /dev/sda
# PASSED = OK, FAILED = replace immediately

# NVMe health
nvme smart-log /dev/nvme0n1
# Check: percentage_used, media_errors, critical_warning

# RAID controller status (Dell PERC)
megacli -LDInfo -Lall -aAll    # Logical drive status
megacli -PDList -aAll          # Physical drive status
Set up Prometheus `smartctl_exporter` and alert when: - `Reallocated_Sector_Ct > 0` (warning) - `Current_Pending_Sector > 0` (warning) - SMART status = FAILED (critical)

Drill 6: Thermal and Power

Difficulty: Medium

Q: A server keeps shutting down unexpectedly. IPMI logs show thermal events. What do you check?

Answer
# Check temperatures
ipmitool sensor list | grep -i temp
# CPU Temp: 95 C  ← TOO HIGH (max usually 85-100C)
# Inlet Temp: 35 C ← Room temperature at server intake
# Exhaust Temp: 55 C

# Check fans
ipmitool sensor list | grep -i fan
# Fan 1: 8500 RPM
# Fan 2: 0 RPM  ← DEAD FAN
Troubleshooting: 1. **Dead fan** — replace fan module (usually hot-swap) 2. **Blocked airflow** — check blanking panels, cable management, raised floor tiles 3. **High ambient temp** — check CRAC/CRAH units, hot/cold aisle containment 4. **Thermal paste dried out** — re-paste CPU heatsink (old servers) 5. **Dust** — clean filters and heatsinks Prevention: - Monitor inlet temperature (should be 18-27C per ASHRAE) - Alert on fan failures - Regular cleaning schedule

Drill 7: PXE Boot

Difficulty: Medium

Q: Explain how PXE boot works for automated server provisioning.

Answer
Power on → NIC sends DHCP request with PXE option
DHCP server responds with:
  - IP address
  - TFTP server address (next-server)
  - Boot filename (pxelinux.0 or grubx64.efi)
NIC downloads bootloader via TFTP
Bootloader downloads kernel + initrd
Kernel boots, runs installer (kickstart/preseed/cloud-init)
Installer partitions disk, installs OS, configures network
Server reboots into installed OS
Configuration management (Ansible) runs first-boot setup
Modern alternative: **cloud-init** with custom images. Build a golden image with Packer, boot from it, cloud-init handles host-specific config.

Drill 8: Capacity Planning

Difficulty: Hard

Q: Your cluster is at 70% capacity. Management asks when you need to buy more hardware. How do you forecast?

Answer
# Current utilization trend (CPU)
avg_over_time(
  (1 - rate(node_cpu_seconds_total{mode="idle"}[5m]))[30d:1d]
)

# Predict when you'll hit 85% (linear projection)
predict_linear(
  node_memory_MemAvailable_bytes[90d], 90 * 24 * 3600
)
Capacity planning framework:
1. Measure current usage (30-90 day window)
   - CPU, memory, disk, network per node
   - Include seasonal patterns (month-end, holidays)

2. Measure growth rate
   - New services deployed per quarter
   - Traffic growth (requests/s trend)
   - Storage growth (GB/month)

3. Calculate runway
   - At current growth rate, when do you hit 85% threshold?
   - Include burst capacity (2x for incidents)

4. Lead time
   - Hardware procurement: 4-12 weeks
   - Rack and stack: 1-2 weeks
   - OS/K8s setup: 1-3 days

5. Order when: current_date + lead_time > exhaustion_date
Rule of thumb: Start procurement when you're 6 months from capacity at current growth rate.

Wiki Navigation

Prerequisites