Server Hardware - Street-Level Ops¶
Real-world hardware diagnostics and triage for production servers.
Quick hardware inventory¶
# One-line system identification
dmidecode -s system-manufacturer && dmidecode -s system-product-name && dmidecode -s system-serial-number
# Dell Inc.
# PowerEdge R750
# ABC1234
# Concise hardware summary
lshw -short
# H/W path Device Class Description
# /0/0 memory 64GiB System Memory
# /0/0/0 memory 32GiB DIMM DDR4 3200 MHz
# /0/0/1 memory 32GiB DIMM DDR4 3200 MHz
# /0/100/1f.2 storage SATA Controller
# /0/100/3/0 eth0 network Ethernet Controller X710
Diagnose a suspected bad DIMM¶
# Check ECC error counts
edac-util -s
# mc0: 0 Uncorrectable Errors, 14 Correctable Errors
# Per-DIMM breakdown
edac-util -l
# mc0: csrow0: ch0: 14 Correctable Errors
# mc0: csrow0: ch1: 0 Correctable Errors
# Cross-reference with physical slot
dmidecode -t memory | grep -A5 "Locator: DIMM_A1"
# Locator: DIMM_A1
# Bank Locator: Not Specified
# Type: DDR4
# Size: 32 GB
# Manufacturer: Samsung
# Check kernel logs for MCE (Machine Check Exception)
journalctl | grep -i "mce\|machine check\|hardware error" | tail -10
dmesg | grep -i "edac\|ecc\|mce"
Under the hood: Correctable ECC errors (CEs) are single-bit flips that the memory controller fixes transparently. A few CEs per year per DIMM is normal (cosmic rays). But a sudden spike — dozens of CEs from the same DIMM in hours — predicts an imminent uncorrectable error (UE). A UE crashes the server. Replace the DIMM proactively when CE rate spikes.
NIC error investigation¶
# Link status and speed
ethtool eth0
# Speed: 25000Mb/s
# Duplex: Full
# Link detected: yes
# Error counters — the important ones
ethtool -S eth0 | grep -E "error|drop|miss|crc|fifo"
# rx_crc_errors: 0
# rx_missed_errors: 0
# tx_errors: 0
# rx_dropped: 342 ← investigate (ring buffer overflow or NIC filter)
# rx_fifo_errors: 0
# Driver and firmware version
ethtool -i eth0
# driver: i40e
# version: 2.23.17
# firmware-version: 9.20 0x8000d95e 22.0.9
# Check for link flapping in dmesg
dmesg -T | grep -i "eth0.*link\|eth0.*up\|eth0.*down"
# [Thu Mar 14 02:15:01 2024] i40e: eth0 NIC Link is Down
# [Thu Mar 14 02:15:03 2024] i40e: eth0 NIC Link is Up, 25 Gbps
One-liner:
ethtool -S eth0 | grep -E "error|drop|miss|crc|fifo"is the single most useful NIC diagnostic command. Non-zerorx_crc_errors= physical layer (cable/SFP). Non-zerorx_dropped= kernel/driver (ring buffer too small). Non-zerotx_errors= usually duplex mismatch or driver bug.Debug clue:
rx_droppedon a NIC usually means the kernel ring buffer overflowed — packets arrived faster than the CPU could process them. Increase the ring buffer:ethtool -G eth0 rx 4096. Ifrx_crc_errorsis non-zero, it points to a physical layer problem: bad cable, failing SFP, or duplex mismatch.
Disk health check¶
# SMART overall health
smartctl -H /dev/sda
# SMART overall-health self-assessment test result: PASSED
# Key SMART attributes
smartctl -A /dev/sda | grep -E "Reallocated|Pending|Uncorrect|Temperature"
# 5 Reallocated_Sector_Ct 0 ← good (bad sectors remapped)
# 197 Current_Pending_Sector 0 ← good (sectors awaiting remap)
# 198 Offline_Uncorrectable 0 ← good (unrecoverable read errors)
# 194 Temperature_Celsius 34 ← normal
# Run a short self-test
smartctl -t short /dev/sda
# Test will complete after about 2 minutes
smartctl -l selftest /dev/sda
# Num Test Status Remaining LifeTime
# 1 Short offline Completed without error 0% 12345
# NVMe health
nvme smart-log /dev/nvme0
# critical_warning : 0
# temperature : 35 C
# percentage_used : 3% (100% = rated endurance reached)
# media_errors : 0
RAID status check¶
# MegaRAID / Dell PERC
storcli /c0 show
# Controller = 0
# Status = Optimal
storcli /c0/vall show
# DG/VD TYPE State Access Consist Cache Size
# 0/0 RAID1 Optimal RW Yes RWBD 446.625 GB
# Check for degraded or rebuilding drives
storcli /c0/eall/sall show | grep -E "Online|Rebuild|Failed"
# 252:0 Online
# 252:1 Online
# Software RAID (mdadm)
cat /proc/mdstat
# md0 : active raid1 sda1[0] sdb1[1]
# 1953513472 blocks super 1.2 [2/2] [UU] ← UU = both drives healthy
Remember: SMART attributes mnemonic: R-P-U — Reallocated (ID 5, bad sectors already remapped), Pending (ID 197, sectors waiting for remap), Uncorrectable (ID 198, failed reads). Any non-zero value on these three is a warning to start planning a disk replacement. Do not wait for the drive to fail completely.
Thermal throttling investigation¶
# CPU temperatures
sensors
# coretemp-isa-0000
# Core 0: +72.0°C (high = +85.0°C, crit = +100.0°C)
# Core 1: +70.0°C
# Check if CPU is throttled
cat /proc/cpuinfo | grep -i mhz | head -4
# cpu MHz : 1200.000 ← should be ~3000+ if not throttled
# Check CPU frequency governor
cpupower frequency-info
# current CPU frequency: 1.20 GHz ← throttled!
> **Default trap:** Many Linux distributions ship with the `powersave` CPU governor, which throttles CPU frequency to save power. On a production server, this adds latency under load. Set the governor to `performance`: `cpupower frequency-set -g performance`. Make it persistent via a systemd unit or tuned profile (`tuned-adm profile throughput-performance`).
# IPMI sensors for inlet temp
ipmitool sensor list | grep -i temp
# Inlet Temp | 42.000 | degrees C | cr ← critical!
# Check fan status
ipmitool sensor list | grep -i fan
# Fan1 | 16800.000 | RPM | ok (fans maxed out)
System Event Log analysis¶
# Recent hardware events
ipmitool sel elist | tail -20
# 1 | 03/14/2024 | Memory #0x01 | Correctable ECC
# 5 | 03/14/2024 | Temperature | Upper Critical - going high
# 8 | 03/14/2024 | Drive Slot 2 | Drive Fault
# Export SEL to file before clearing
ipmitool sel elist > /var/log/sel-$(hostname)-$(date +%F).log
ipmitool sel clear
Kernel hardware error log¶
# All hardware-related kernel messages
dmesg -T -l err,crit,alert,emerg
# [Thu Mar 14 02:15:33] mce: [Hardware Error]: CPU 4: Machine Check Exception
# [Thu Mar 14 02:15:33] EDAC MC0: 1 CE memory read error on CPU_SrcID#0
# Live-watch for new hardware errors
dmesg -Tw | grep -iE "error|fault|fail|hardware|mce|edac"
Remote diagnostics via BMC¶
# Full sensor dump to a file for comparison
ipmitool -I lanplus -H 10.0.10.5 -U admin -P secret sdr list full > /tmp/sensors-$(date +%F).txt
# Remote serial console (see POST, GRUB, kernel boot)
ipmitool -I lanplus -H 10.0.10.5 -U admin -P secret sol activate
# Chassis status (power, LED, intrusion)
ipmitool chassis status
# System Power : on
# Main Power Fault : false
# Chassis Intrusion : inactive
Remember: IPMI transport mnemonic:
-I lanplus= encrypted (RMCP+, use this),-I lan= unencrypted (legacy, avoid). Always uselanplusfor remote IPMI to prevent credentials from being sent in plaintext. Better yet, put all BMC/iDRAC interfaces on a dedicated out-of-band management VLAN that is not routable from the general network.Gotcha:
dmidecoderequires root and reads the DMI/SMBIOS table from firmware. In virtual machines, it reports the hypervisor's emulated hardware — not physical hardware. Usesystemd-detect-virtfirst to check if you are on bare metal before trustingdmidecodeoutput for inventory purposes.