Skip to content

Server Hardware - Street-Level Ops

Real-world hardware diagnostics and triage for production servers.

Quick hardware inventory

# One-line system identification
dmidecode -s system-manufacturer && dmidecode -s system-product-name && dmidecode -s system-serial-number
# Dell Inc.
# PowerEdge R750
# ABC1234

# Concise hardware summary
lshw -short
# H/W path        Device      Class       Description
# /0/0                        memory      64GiB System Memory
# /0/0/0                      memory      32GiB DIMM DDR4 3200 MHz
# /0/0/1                      memory      32GiB DIMM DDR4 3200 MHz
# /0/100/1f.2                 storage     SATA Controller
# /0/100/3/0       eth0       network     Ethernet Controller X710

Diagnose a suspected bad DIMM

# Check ECC error counts
edac-util -s
# mc0: 0 Uncorrectable Errors, 14 Correctable Errors

# Per-DIMM breakdown
edac-util -l
# mc0: csrow0: ch0: 14 Correctable Errors
# mc0: csrow0: ch1: 0 Correctable Errors

# Cross-reference with physical slot
dmidecode -t memory | grep -A5 "Locator: DIMM_A1"
# Locator: DIMM_A1
# Bank Locator: Not Specified
# Type: DDR4
# Size: 32 GB
# Manufacturer: Samsung

# Check kernel logs for MCE (Machine Check Exception)
journalctl | grep -i "mce\|machine check\|hardware error" | tail -10
dmesg | grep -i "edac\|ecc\|mce"

Under the hood: Correctable ECC errors (CEs) are single-bit flips that the memory controller fixes transparently. A few CEs per year per DIMM is normal (cosmic rays). But a sudden spike — dozens of CEs from the same DIMM in hours — predicts an imminent uncorrectable error (UE). A UE crashes the server. Replace the DIMM proactively when CE rate spikes.

NIC error investigation

# Link status and speed
ethtool eth0
# Speed: 25000Mb/s
# Duplex: Full
# Link detected: yes

# Error counters — the important ones
ethtool -S eth0 | grep -E "error|drop|miss|crc|fifo"
# rx_crc_errors: 0
# rx_missed_errors: 0
# tx_errors: 0
# rx_dropped: 342        ← investigate (ring buffer overflow or NIC filter)
# rx_fifo_errors: 0

# Driver and firmware version
ethtool -i eth0
# driver: i40e
# version: 2.23.17
# firmware-version: 9.20 0x8000d95e 22.0.9

# Check for link flapping in dmesg
dmesg -T | grep -i "eth0.*link\|eth0.*up\|eth0.*down"
# [Thu Mar 14 02:15:01 2024] i40e: eth0 NIC Link is Down
# [Thu Mar 14 02:15:03 2024] i40e: eth0 NIC Link is Up, 25 Gbps

One-liner: ethtool -S eth0 | grep -E "error|drop|miss|crc|fifo" is the single most useful NIC diagnostic command. Non-zero rx_crc_errors = physical layer (cable/SFP). Non-zero rx_dropped = kernel/driver (ring buffer too small). Non-zero tx_errors = usually duplex mismatch or driver bug.

Debug clue: rx_dropped on a NIC usually means the kernel ring buffer overflowed — packets arrived faster than the CPU could process them. Increase the ring buffer: ethtool -G eth0 rx 4096. If rx_crc_errors is non-zero, it points to a physical layer problem: bad cable, failing SFP, or duplex mismatch.

Disk health check

# SMART overall health
smartctl -H /dev/sda
# SMART overall-health self-assessment test result: PASSED

# Key SMART attributes
smartctl -A /dev/sda | grep -E "Reallocated|Pending|Uncorrect|Temperature"
# 5   Reallocated_Sector_Ct   0       ← good (bad sectors remapped)
# 197 Current_Pending_Sector  0       ← good (sectors awaiting remap)
# 198 Offline_Uncorrectable   0       ← good (unrecoverable read errors)
# 194 Temperature_Celsius     34      ← normal

# Run a short self-test
smartctl -t short /dev/sda
# Test will complete after about 2 minutes

smartctl -l selftest /dev/sda
# Num  Test              Status         Remaining  LifeTime
# 1    Short offline     Completed without error   0%   12345

# NVMe health
nvme smart-log /dev/nvme0
# critical_warning                        : 0
# temperature                             : 35 C
# percentage_used                         : 3% (100% = rated endurance reached)
# media_errors                            : 0

RAID status check

# MegaRAID / Dell PERC
storcli /c0 show
# Controller = 0
# Status = Optimal

storcli /c0/vall show
# DG/VD TYPE  State   Access Consist Cache  Size
# 0/0   RAID1 Optimal RW     Yes     RWBD   446.625 GB

# Check for degraded or rebuilding drives
storcli /c0/eall/sall show | grep -E "Online|Rebuild|Failed"
# 252:0  Online
# 252:1  Online

# Software RAID (mdadm)
cat /proc/mdstat
# md0 : active raid1 sda1[0] sdb1[1]
#       1953513472 blocks super 1.2 [2/2] [UU]     ← UU = both drives healthy

Remember: SMART attributes mnemonic: R-P-U — Reallocated (ID 5, bad sectors already remapped), Pending (ID 197, sectors waiting for remap), Uncorrectable (ID 198, failed reads). Any non-zero value on these three is a warning to start planning a disk replacement. Do not wait for the drive to fail completely.

Thermal throttling investigation

# CPU temperatures
sensors
# coretemp-isa-0000
# Core 0:        +72.0°C  (high = +85.0°C, crit = +100.0°C)
# Core 1:        +70.0°C

# Check if CPU is throttled
cat /proc/cpuinfo | grep -i mhz | head -4
# cpu MHz         : 1200.000    ← should be ~3000+ if not throttled

# Check CPU frequency governor
cpupower frequency-info
# current CPU frequency: 1.20 GHz    ← throttled!

> **Default trap:** Many Linux distributions ship with the `powersave` CPU governor, which throttles CPU frequency to save power. On a production server, this adds latency under load. Set the governor to `performance`: `cpupower frequency-set -g performance`. Make it persistent via a systemd unit or tuned profile (`tuned-adm profile throughput-performance`).

# IPMI sensors for inlet temp
ipmitool sensor list | grep -i temp
# Inlet Temp       | 42.000     | degrees C  | cr    ← critical!

# Check fan status
ipmitool sensor list | grep -i fan
# Fan1             | 16800.000  | RPM        | ok    (fans maxed out)

System Event Log analysis

# Recent hardware events
ipmitool sel elist | tail -20
#    1 | 03/14/2024 | Memory #0x01 | Correctable ECC
#    5 | 03/14/2024 | Temperature  | Upper Critical - going high
#    8 | 03/14/2024 | Drive Slot 2 | Drive Fault

# Export SEL to file before clearing
ipmitool sel elist > /var/log/sel-$(hostname)-$(date +%F).log
ipmitool sel clear

Kernel hardware error log

# All hardware-related kernel messages
dmesg -T -l err,crit,alert,emerg
# [Thu Mar 14 02:15:33] mce: [Hardware Error]: CPU 4: Machine Check Exception
# [Thu Mar 14 02:15:33] EDAC MC0: 1 CE memory read error on CPU_SrcID#0

# Live-watch for new hardware errors
dmesg -Tw | grep -iE "error|fault|fail|hardware|mce|edac"

Remote diagnostics via BMC

# Full sensor dump to a file for comparison
ipmitool -I lanplus -H 10.0.10.5 -U admin -P secret sdr list full > /tmp/sensors-$(date +%F).txt

# Remote serial console (see POST, GRUB, kernel boot)
ipmitool -I lanplus -H 10.0.10.5 -U admin -P secret sol activate

# Chassis status (power, LED, intrusion)
ipmitool chassis status
# System Power         : on
# Main Power Fault     : false
# Chassis Intrusion    : inactive

Remember: IPMI transport mnemonic: -I lanplus = encrypted (RMCP+, use this), -I lan = unencrypted (legacy, avoid). Always use lanplus for remote IPMI to prevent credentials from being sent in plaintext. Better yet, put all BMC/iDRAC interfaces on a dedicated out-of-band management VLAN that is not routable from the general network.

Gotcha: dmidecode requires root and reads the DMI/SMBIOS table from firmware. In virtual machines, it reports the hypervisor's emulated hardware — not physical hardware. Use systemd-detect-virt first to check if you are on bare metal before trusting dmidecode output for inventory purposes.