Server Hardware¶

25 cards — 🟢 3 easy | 🟡 4 medium | 🔴 3 hard

🟢 Easy (3)¶

1. What command gives a quick hardware inventory on Linux?

Show answer

lshw -short. It lists all hardware classes (CPU, memory, network, disk, etc.) in a concise table. Filter by class: lshw -class memory, lshw -class network. Output as JSON: lshw -json -short.

Remember: "RAID protects against disk failure, not data corruption." RAID is not a backup.

Example: RAID 1 = mirror (2 disks), RAID 5 = striping with parity (min 3 disks), RAID 10 = mirror + stripe (min 4 disks).

Gotcha: lshw requires root for full output. Without root, it hides some details. Use `sudo lshw -short` for the complete picture.

2. How do you read hardware information from BIOS/UEFI tables on Linux?

Show answer

dmidecode reads SMBIOS/DMI data. Key types: dmidecode -t system (manufacturer, model, serial), dmidecode -t memory (DIMM details), dmidecode -t processor (CPU), dmidecode -t bios (BIOS version). Quick serial: dmidecode -s system-serial-number.

Remember: "ECC RAM detects and corrects single-bit errors." Servers use ECC; desktops usually don't.

Fun fact: Google research found 8% of DIMMs have at least one error per year.

Name origin: DMI = Desktop Management Interface. SMBIOS = System Management BIOS. Both are BIOS/UEFI tables describing hardware to the OS.

3. How do you check for hardware errors in the kernel message buffer?

Show answer

dmesg -T -l err,crit,alert,emerg shows only error-level and above messages with human-readable timestamps. Filter for hardware: dmesg | grep -i "error\|fault\|fail\|hardware\|mce". Watch live: dmesg -Tw.

Remember: "IPMI/BMC = out-of-band management." Access the server even when the OS is down. iDRAC (Dell), iLO (HP), IMM (Lenovo).

Gotcha: BMC has its own IP and web interface — secure it! Default passwords are a common attack vector.

Debug clue: MCE (Machine Check Exception) in dmesg = CPU/memory hardware error. I/O errors = disk/controller. Link down = NIC/cable.

🟡 Medium (4)¶

1. What is ECC memory and how do you check for memory errors on Linux?

Show answer

ECC (Error-Correcting Code) memory detects and corrects single-bit errors (correctable, CE) and detects double-bit errors (uncorrectable, UE — usually causes kernel panic). Check with: edac-util -s (summary), edac-util -l (per-DIMM errors). Increasing CE counts on a DIMM indicate it is failing.

Remember: "Hot-swap = replace without downtime." Drives, power supplies, and fans are typically hot-swappable in servers. CPUs and RAM are not.

Number anchor: Google research found 8% of DIMMs experience at least one correctable error per year. ECC catches these silently.

2. What SMART attributes indicate a failing disk?

Show answer

Reallocated Sector Count (bad sectors remapped — increasing means drive is dying), Current Pending Sector Count (sectors waiting to be remapped), and Uncorrectable Error Count (read errors that could not be recovered). Check with: smartctl -a /dev/sda. Run self-test: smartctl -t short /dev/sda.

Remember: "Reallocated + Pending + Uncorrectable = the three horsemen." Any non-zero value warrants investigation.

3. How do you diagnose NIC hardware problems?

Show answer

ethtool eth0 (link status, speed, duplex), ethtool -i eth0 (driver, firmware), ethtool -S eth0 | grep -E "error|drop|miss|crc" (error counters). Increasing CRC errors indicate cable or hardware issues. Link flapping and rx/tx drops suggest a failing NIC.

Debug clue: Increasing CRC errors = bad cable or connector. rx_missed_errors = NIC ring buffer overflow (increase with ethtool -G).

4. What is NUMA and why does it matter for server performance?

Show answer

NUMA (Non-Uniform Memory Access) means each CPU socket has local memory that is faster to access than remote memory (other socket's memory). Applications should be pinned to use local memory. Check topology: numactl --hardware or lscpu | grep NUMA. Misaligned NUMA access causes latency.

Name origin: NUMA = Non-Uniform Memory Access. The alternative is UMA (Uniform Memory Access), which doesn\'t scale beyond ~4 sockets.

🔴 Hard (3)¶

1. What is an MCE and how do you investigate one?

Show answer

MCE (Machine Check Exception) is a CPU-reported hardware error. Uncorrectable MCEs cause kernel panics. Investigate: journalctl | grep -i "mce\|machine check", mcelog --client (if mcelog daemon runs). Common causes: failing DIMMs, CPU cache errors, overheating. Check edac-util for memory-related MCEs.

Name origin: MCE = Machine Check Exception. CPUs report internal hardware errors via this mechanism. Uncorrectable MCEs crash the system.

2. What causes CPU thermal throttling and how do you detect it?

Show answer

Throttling occurs when CPU exceeds thermal limits: failed fans, blocked airflow, ambient temperature too high (CRAC failure), dust buildup. Detect: sensors (lm-sensors), ipmitool sensor list | grep -i temp, cpupower frequency-info (check if frequency is reduced). Check dmesg for thermal throttling messages.

Number anchor: Most server CPUs throttle at 95-100°C. Each 10°C increase above design temp roughly doubles component failure rate.

3. What is the recommended workflow for diagnosing a suspected hardware issue?

Show answer

1. Check dmesg -T -l err,crit (MCE, I/O errors, link down). 2. Check ipmitool sel elist (BMC events). 3. Check smartctl -a /dev/sdX (disk health). 4. Check edac-util -s (memory errors). 5. Check ethtool -S eth0 (NIC errors). 6. Check ipmitool sensor list (thermal, voltage, fan). 7. Check vendor BMC UI (iDRAC/iLO) for detailed diagnostics.

Remember: "Outside-in: BMC logs → kernel logs → device-specific tools." Start with the broadest view (BMC) and narrow down.