Portal | Level: L1: Foundations | Topics: Rack & Stack, Out-of-Band Management | Domain: Datacenter & Hardware
Datacenter & Hardware Drills¶
Remember: The physical troubleshooting order: Power (is it plugged in and on?) -> POST (does the BIOS initialize?) -> Network (can you reach the BMC/OS?) -> Storage (are disks healthy?). Mnemonic: "PPNS" — always start at the power cable before touching software. 80% of "server down" tickets in a datacenter are physical: loose cables, failed PSUs, dead fans.
Fun fact: IPMI (Intelligent Platform Management Interface) operates on a completely separate processor (the BMC) with its own network stack, independent of the server's main CPU and OS. This is why you can power-cycle a server with a frozen kernel — the BMC is a separate embedded computer on the motherboard.
Drill 1: Server Won't POST¶
Difficulty: Medium
Q: You rack a new server and it won't POST (no display, no beep codes). Walk through the physical troubleshooting.
Answer
1. Check power
- Power cable seated? PSU switch on?
- PDU showing output? Try a different outlet.
- PSU LED on? If not, suspect dead PSU.
2. Check display
- Monitor connected to the right port (not add-in GPU vs BMC)?
- Try IPMI/iLO/iDRAC remote console.
3. Reseat components
- RAM: reseat all DIMMs, try one DIMM at a time
- CPU: check for bent pins (AMD) or socket debris (Intel)
- PCIe cards: remove all add-in cards, try bare minimum
4. Listen for beep codes
- No RAM: continuous beeping
- Bad CPU: no beeps, no POST
- Check vendor manual for specific beep patterns
5. BMC/IPMI
- Connect to BMC network port
- Access web UI or: ipmitool -I lanplus -H <bmc-ip> -U admin chassis status
- Check hardware event log
Drill 2: RAID Levels¶
Difficulty: Easy
Q: Explain RAID 0, 1, 5, 6, and 10. Which do you use for databases?
Answer
| RAID | Min Disks | Fault Tolerance | Space Efficiency | Use Case | |------|-----------|----------------|-----------------|----------| | 0 | 2 | None (stripe) | 100% | Temp data, cache | | 1 | 2 | 1 disk (mirror) | 50% | Boot drives, small DBs | | 5 | 3 | 1 disk (parity) | (n-1)/n | Read-heavy workloads | | 6 | 4 | 2 disks (dual parity) | (n-2)/n | Large arrays | | 10 | 4 | 1 per mirror pair | 50% | Databases, high IOPS | **For databases**: RAID 10 (mirrors + stripes). Best write performance, survives multiple disk failures (one per mirror pair). **Never for databases**: RAID 5 — write penalty from parity calculation, slow rebuild on large disks.Drill 3: Network Cabling¶
Difficulty: Easy
Q: A server intermittently drops packets. NIC shows CRC errors and frame errors. What's likely wrong?
Answer
Physical layer issues:
1. Bad cable — try a known-good cable
2. Bad SFP/transceiver — swap it
3. Dirty fiber connector — clean with a lint-free wipe
4. Cable too long — check max distance:
- Cat6: 100m for 1GbE/10GbE
- OM3 fiber: 300m for 10GbE
- OM4 fiber: 400m for 10GbE
- Single-mode: 10+ km
5. Duplex mismatch — both ends should be auto-negotiate or forced to same
Check:
ethtool eth0 # Speed, duplex, link status
ethtool -S eth0 # Error counters
ip -s link show eth0 # Packet stats
Drill 4: IPMI/BMC¶
Difficulty: Medium
Q: The OS is unresponsive on a remote server. You can't SSH in. How do you recover without physical access?
Answer
# Connect to BMC (out-of-band management)
# iLO (HP), iDRAC (Dell), IPMI (generic)
# Check power status
ipmitool -I lanplus -H <bmc-ip> -U admin -P pass chassis status
# Open remote console (KVM over IP)
# Use web UI: https://<bmc-ip>
# Force power cycle (last resort)
ipmitool -I lanplus -H <bmc-ip> -U admin -P pass chassis power cycle
# Check hardware event log
ipmitool -I lanplus -H <bmc-ip> -U admin -P pass sel list
# Check sensor readings
ipmitool -I lanplus -H <bmc-ip> -U admin -P pass sensor list
Drill 5: Disk Health Monitoring¶
Difficulty: Easy
Q: How do you check if a disk is about to fail before it actually fails?
Answer
# SMART health check
smartctl -a /dev/sda
# Key attributes to watch:
# Reallocated_Sector_Ct — bad sectors remapped (rising = dying disk)
# Current_Pending_Sector — sectors waiting to be remapped
# Offline_Uncorrectable — sectors that can't be read
# UDMA_CRC_Error_Count — cable/connection issues
# Quick health test
smartctl -H /dev/sda
# PASSED = OK, FAILED = replace immediately
# NVMe health
nvme smart-log /dev/nvme0n1
# Check: percentage_used, media_errors, critical_warning
# RAID controller status (Dell PERC)
megacli -LDInfo -Lall -aAll # Logical drive status
megacli -PDList -aAll # Physical drive status
Drill 6: Thermal and Power¶
Difficulty: Medium
Q: A server keeps shutting down unexpectedly. IPMI logs show thermal events. What do you check?
Answer
Troubleshooting: 1. **Dead fan** — replace fan module (usually hot-swap) 2. **Blocked airflow** — check blanking panels, cable management, raised floor tiles 3. **High ambient temp** — check CRAC/CRAH units, hot/cold aisle containment 4. **Thermal paste dried out** — re-paste CPU heatsink (old servers) 5. **Dust** — clean filters and heatsinks Prevention: - Monitor inlet temperature (should be 18-27C per ASHRAE) - Alert on fan failures - Regular cleaning scheduleDrill 7: PXE Boot¶
Difficulty: Medium
Q: Explain how PXE boot works for automated server provisioning.
Answer
Power on → NIC sends DHCP request with PXE option
↓
DHCP server responds with:
- IP address
- TFTP server address (next-server)
- Boot filename (pxelinux.0 or grubx64.efi)
↓
NIC downloads bootloader via TFTP
↓
Bootloader downloads kernel + initrd
↓
Kernel boots, runs installer (kickstart/preseed/cloud-init)
↓
Installer partitions disk, installs OS, configures network
↓
Server reboots into installed OS
↓
Configuration management (Ansible) runs first-boot setup
Drill 8: Capacity Planning¶
Difficulty: Hard
Q: Your cluster is at 70% capacity. Management asks when you need to buy more hardware. How do you forecast?
Answer
# Current utilization trend (CPU)
avg_over_time(
(1 - rate(node_cpu_seconds_total{mode="idle"}[5m]))[30d:1d]
)
# Predict when you'll hit 85% (linear projection)
predict_linear(
node_memory_MemAvailable_bytes[90d], 90 * 24 * 3600
)
1. Measure current usage (30-90 day window)
- CPU, memory, disk, network per node
- Include seasonal patterns (month-end, holidays)
2. Measure growth rate
- New services deployed per quarter
- Traffic growth (requests/s trend)
- Storage growth (GB/month)
3. Calculate runway
- At current growth rate, when do you hit 85% threshold?
- Include burst capacity (2x for incidents)
4. Lead time
- Hardware procurement: 4-12 weeks
- Rack and stack: 1-2 weeks
- OS/K8s setup: 1-3 days
5. Order when: current_date + lead_time > exhaustion_date
Wiki Navigation¶
Prerequisites¶
- Datacenter & Server Hardware (Topic Pack, L1)
Related Content¶
- Datacenter & Server Hardware (Topic Pack, L1) — Out-of-Band Management, Rack & Stack
- Skillcheck: Datacenter (Assessment, L1) — Out-of-Band Management, Rack & Stack
- Bare-Metal Provisioning (Topic Pack, L2) — Out-of-Band Management
- Case Study: BMC Clock Skew Cert Failure (Case Study, L2) — Out-of-Band Management
- Case Study: Cable Management Wrong Port (Case Study, L1) — Rack & Stack
- Case Study: Link Flaps Bad Optic (Case Study, L1) — Rack & Stack
- Case Study: Rack PDU Overload Alert (Case Study, L1) — Rack & Stack
- Case Study: Serial Console Garbled (Case Study, L1) — Out-of-Band Management
- Case Study: Server Remote Console Lag (Case Study, L1) — Out-of-Band Management
- Case Study: iDRAC Unreachable OS Up (Case Study, L1) — Out-of-Band Management
Pages that link here¶
- BMC Clock Skew - Certificate Failure
- Bare-Metal Provisioning
- Datacenter & Server Hardware
- Datacenter Ops Domain
- Datacenter Skillcheck
- Drills
- Link Flaps - Bad Optic
- PDU Reporting Overload Warning
- Remote KVM/Console Extremely Laggy
- Serial-over-LAN Output Garbled
- Server Cabled to Wrong Switch Port / Wrong VLAN
- iDRAC Unreachable, OS Up