Server Hardware - Street-Level Ops¶

Real-world hardware diagnostics and triage for production servers.

Quick hardware inventory¶

# One-line system identification
dmidecode -s system-manufacturer && dmidecode -s system-product-name && dmidecode -s system-serial-number
# Dell Inc.
# PowerEdge R750
# ABC1234

# Concise hardware summary
lshw -short
# H/W path        Device      Class       Description
# /0/0                        memory      64GiB System Memory
# /0/0/0                      memory      32GiB DIMM DDR4 3200 MHz
# /0/0/1                      memory      32GiB DIMM DDR4 3200 MHz
# /0/100/1f.2                 storage     SATA Controller
# /0/100/3/0       eth0       network     Ethernet Controller X710

Diagnose a suspected bad DIMM¶

# Check ECC error counts
edac-util -s
# mc0: 0 Uncorrectable Errors, 14 Correctable Errors

# Per-DIMM breakdown
edac-util -l
# mc0: csrow0: ch0: 14 Correctable Errors
# mc0: csrow0: ch1: 0 Correctable Errors

# Cross-reference with physical slot
dmidecode -t memory | grep -A5 "Locator: DIMM_A1"
# Locator: DIMM_A1
# Bank Locator: Not Specified
# Type: DDR4
# Size: 32 GB
# Manufacturer: Samsung

# Check kernel logs for MCE (Machine Check Exception)
journalctl | grep -i "mce\|machine check\|hardware error" | tail -10
dmesg | grep -i "edac\|ecc\|mce"

Under the hood: Correctable ECC errors (CEs) are single-bit flips that the memory controller fixes transparently. A few CEs per year per DIMM is normal (cosmic rays). But a sudden spike — dozens of CEs from the same DIMM in hours — predicts an imminent uncorrectable error (UE). A UE crashes the server. Replace the DIMM proactively when CE rate spikes.

NIC error investigation¶

# Link status and speed
ethtool eth0
# Speed: 25000Mb/s
# Duplex: Full
# Link detected: yes

# Error counters — the important ones
ethtool -S eth0 | grep -E "error|drop|miss|crc|fifo"
# rx_crc_errors: 0
# rx_missed_errors: 0
# tx_errors: 0
# rx_dropped: 342        ← investigate (ring buffer overflow or NIC filter)
# rx_fifo_errors: 0

# Driver and firmware version
ethtool -i eth0
# driver: i40e
# version: 2.23.17
# firmware-version: 9.20 0x8000d95e 22.0.9

# Check for link flapping in dmesg
dmesg -T | grep -i "eth0.*link\|eth0.*up\|eth0.*down"
# [Thu Mar 14 02:15:01 2024] i40e: eth0 NIC Link is Down
# [Thu Mar 14 02:15:03 2024] i40e: eth0 NIC Link is Up, 25 Gbps

One-liner: ethtool -S eth0 | grep -E "error|drop|miss|crc|fifo" is the single most useful NIC diagnostic command. Non-zero rx_crc_errors = physical layer (cable/SFP). Non-zero rx_dropped = kernel/driver (ring buffer too small). Non-zero tx_errors = usually duplex mismatch or driver bug.

Debug clue: rx_dropped on a NIC usually means the kernel ring buffer overflowed — packets arrived faster than the CPU could process them. Increase the ring buffer: ethtool -G eth0 rx 4096. If rx_crc_errors is non-zero, it points to a physical layer problem: bad cable, failing SFP, or duplex mismatch.

Disk health check¶

# SMART overall health
smartctl -H /dev/sda
# SMART overall-health self-assessment test result: PASSED

# Key SMART attributes
smartctl -A /dev/sda | grep -E "Reallocated|Pending|Uncorrect|Temperature"
# 5   Reallocated_Sector_Ct   0       ← good (bad sectors remapped)
# 197 Current_Pending_Sector  0       ← good (sectors awaiting remap)
# 198 Offline_Uncorrectable   0       ← good (unrecoverable read errors)
# 194 Temperature_Celsius     34      ← normal

# Run a short self-test
smartctl -t short /dev/sda
# Test will complete after about 2 minutes

smartctl -l selftest /dev/sda
# Num  Test              Status         Remaining  LifeTime
# 1    Short offline     Completed without error   0%   12345

# NVMe health
nvme smart-log /dev/nvme0
# critical_warning                        : 0
# temperature                             : 35 C
# percentage_used                         : 3% (100% = rated endurance reached)
# media_errors                            : 0

RAID status check¶

# MegaRAID / Dell PERC
storcli /c0 show
# Controller = 0
# Status = Optimal

storcli /c0/vall show
# DG/VD TYPE  State   Access Consist Cache  Size
# 0/0   RAID1 Optimal RW     Yes     RWBD   446.625 GB

# Check for degraded or rebuilding drives
storcli /c0/eall/sall show | grep -E "Online|Rebuild|Failed"
# 252:0  Online
# 252:1  Online

# Software RAID (mdadm)
cat /proc/mdstat
# md0 : active raid1 sda1[0] sdb1[1]
#       1953513472 blocks super 1.2 [2/2] [UU]     ← UU = both drives healthy

Remember: SMART attributes mnemonic: R-P-U — Reallocated (ID 5, bad sectors already remapped), Pending (ID 197, sectors waiting for remap), Uncorrectable (ID 198, failed reads). Any non-zero value on these three is a warning to start planning a disk replacement. Do not wait for the drive to fail completely.

Thermal throttling investigation¶

# CPU temperatures
sensors
# coretemp-isa-0000
# Core 0:        +72.0°C  (high = +85.0°C, crit = +100.0°C)
# Core 1:        +70.0°C

# Check if CPU is throttled
cat /proc/cpuinfo | grep -i mhz | head -4
# cpu MHz         : 1200.000    ← should be ~3000+ if not throttled

# Check CPU frequency governor
cpupower frequency-info
# current CPU frequency: 1.20 GHz    ← throttled!

> **Default trap:** Many Linux distributions ship with the `powersave` CPU governor, which throttles CPU frequency to save power. On a production server, this adds latency under load. Set the governor to `performance`: `cpupower frequency-set -g performance`. Make it persistent via a systemd unit or tuned profile (`tuned-adm profile throughput-performance`).

# IPMI sensors for inlet temp
ipmitool sensor list | grep -i temp
# Inlet Temp       | 42.000     | degrees C  | cr    ← critical!

# Check fan status
ipmitool sensor list | grep -i fan
# Fan1             | 16800.000  | RPM        | ok    (fans maxed out)

System Event Log analysis¶

# Recent hardware events
ipmitool sel elist | tail -20
#    1 | 03/14/2024 | Memory #0x01 | Correctable ECC
#    5 | 03/14/2024 | Temperature  | Upper Critical - going high
#    8 | 03/14/2024 | Drive Slot 2 | Drive Fault

# Export SEL to file before clearing
ipmitool sel elist > /var/log/sel-$(hostname)-$(date +%F).log
ipmitool sel clear

Kernel hardware error log¶

# All hardware-related kernel messages
dmesg -T -l err,crit,alert,emerg
# [Thu Mar 14 02:15:33] mce: [Hardware Error]: CPU 4: Machine Check Exception
# [Thu Mar 14 02:15:33] EDAC MC0: 1 CE memory read error on CPU_SrcID#0

# Live-watch for new hardware errors
dmesg -Tw | grep -iE "error|fault|fail|hardware|mce|edac"

Remote diagnostics via BMC¶

# Full sensor dump to a file for comparison
ipmitool -I lanplus -H 10.0.10.5 -U admin -P secret sdr list full > /tmp/sensors-$(date +%F).txt

# Remote serial console (see POST, GRUB, kernel boot)
ipmitool -I lanplus -H 10.0.10.5 -U admin -P secret sol activate

# Chassis status (power, LED, intrusion)
ipmitool chassis status
# System Power         : on
# Main Power Fault     : false
# Chassis Intrusion    : inactive

Remember: IPMI transport mnemonic: -I lanplus = encrypted (RMCP+, use this), -I lan = unencrypted (legacy, avoid). Always use lanplus for remote IPMI to prevent credentials from being sent in plaintext. Better yet, put all BMC/iDRAC interfaces on a dedicated out-of-band management VLAN that is not routable from the general network.

Gotcha: dmidecode requires root and reads the DMI/SMBIOS table from firmware. In virtual machines, it reports the hypervisor's emulated hardware — not physical hardware. Use systemd-detect-virt first to check if you are on bare metal before trusting dmidecode output for inventory purposes.

Server Hardware - Street-Level Ops¶

Quick hardware inventory¶

Diagnose a suspected bad DIMM¶

NIC error investigation¶

Disk health check¶

RAID status check¶

Thermal throttling investigation¶

System Event Log analysis¶

Kernel hardware error log¶

Remote diagnostics via BMC¶

Pages that link here¶