Server Hardware Footguns¶

Mistakes that cause hardware failures, data loss, or extended outages on bare-metal servers.

1. Ignoring rising correctable ECC error counts¶

You see a few correctable ECC errors in edac-util and dismiss them as noise. The error rate doubles each week. Eventually the DIMM throws an uncorrectable error, causing a kernel panic. Production goes down with no warning.

Fix: Alert on correctable ECC error rates. Any DIMM with a rising trend should be scheduled for replacement. One correctable error per month is normal. Ten per day is a dying DIMM.

Under the hood: Correctable ECC errors fix single-bit flips in real time with no application impact. Uncorrectable (multi-bit) errors trigger a Machine Check Exception (MCE), which causes an immediate kernel panic. The transition from "occasional correctable" to "uncorrectable" is often sudden. Dell's threshold: 24+ correctable errors in 24 hours = defective DIMM, schedule replacement.

2. Running SMART checks only after a failure¶

A drive dies. You check SMART on the remaining drives and find two more with high reallocated sector counts. They were dying for months but nobody looked.

Fix: Run smartctl -H /dev/sdX on all drives weekly via cron. Alert on Reallocated_Sector_Ct > 0, Current_Pending_Sector > 0, or Offline_Uncorrectable > 0. Automate with smartmontools.

Remember: The three SMART death indicators: Reallocated_Sector_Ct (bad sectors already moved to spare area), Current_Pending_Sector (sectors waiting for a read retry or reallocation), Offline_Uncorrectable (sectors that failed offline testing). Any non-zero value on these three means the drive is actively failing.

3. Pulling a drive from a degraded RAID array¶

The RAID array is degraded — one drive failed. You pull a second drive to "check it." Now two drives are missing. RAID1 or RAID5 is down. RAID6 is on its last leg. If you pulled the wrong drive, the array is destroyed.

Fix: Never pull a drive from a degraded array without confirming the failed slot. Use storcli or megacli to identify the failed drive by slot number and LED indicator. Replace only the failed drive.

4. Mixing DIMM speeds or sizes within a channel¶

You add a 16GB DIMM to a server that has 32GB DIMMs. The memory controller down-clocks all DIMMs to match the slowest one. Performance drops 20%. Worse, some servers will not POST with mismatched configurations.

Fix: All DIMMs in a server should be identical: same capacity, speed, rank, and manufacturer. Check the vendor's memory population guide before ordering.

5. Not monitoring inlet temperature¶

The CRAC unit in your row fails overnight. Inlet temperature climbs. CPUs throttle, then servers start thermal shutdowns. Your monitoring caught the application latency but by then half the rack was overheating.

Fix: Monitor inlet temperature per server via IPMI. Alert at 35C warning, 40C critical. ipmitool sensor list | grep -i "inlet temp".

6. Replacing a failed NIC without checking firmware¶

You swap a failed NIC with a spare. The spare has firmware from three years ago. It does not support the VXLAN offload your overlay network depends on. Packets drop silently. Debugging takes hours because the link shows "up."

Fix: After any NIC replacement, verify firmware version with ethtool -i ethX. Match firmware to the fleet standard. Update before putting the server back in service.

7. Trusting `lshw` output over physical inspection¶

lshw shows 64GB of RAM. But one DIMM is not seated properly — it is detected intermittently. The server works fine until a vibration unseats it completely. Kernel panic from a hardware perspective, OOM from the OS perspective.

Fix: After any physical work (rack moves, component swaps), verify dmidecode -t memory shows all expected DIMMs. Cross-reference with the vendor spec.

8. Ignoring NIC CRC errors and link flaps¶

ethtool -S eth0 shows increasing CRC errors. You assume it is "normal." The cause is a bad cable or a failing transceiver. One day the link flaps during peak traffic. The failover is not fast enough. Users see dropped connections.

Fix: CRC errors are never normal. Check the cable, clean the transceiver, try a different switch port. Zero CRC errors is the only acceptable count.

Debug clue: ethtool -S eth0 | grep -i 'crc\|error\|drop\|miss' gives you the full picture. Rising rx_crc_errors = bad cable or dirty transceiver. Rising rx_missed_errors = ring buffer overflow (increase with ethtool -G). Rising tx_errors = driver or firmware issue.

9. No out-of-band management network¶

Your server crashes and is unresponsive to SSH. You have no BMC/IPMI access because the management port was never cabled. Someone has to physically walk to the datacenter to press the power button or pull a crash cart.

Fix: Cable every server's BMC port to the management network at rack time. Configure IPMI IP, credentials, and VLAN. Verify with ipmitool -I lanplus -H <bmc-ip> power status before putting the server in service.

10. Skipping POST diagnostics after a hardware change¶

You add a PCIe card and boot straight into the OS. POST tried to warn you about a resource conflict but you had the console redirected and missed it. The card works intermittently, causing random I/O errors.

Fix: After any hardware change, watch POST output via BMC console or SOL. Check ipmitool sel elist for POST errors. Verify dmesg shows clean device initialization.