Skip to content

Server Hardware - Primer

Why This Matters

Software runs on hardware. When a server panics, drops packets, or corrupts data, the root cause is often physical: a failing DIMM, a degraded NIC, a thermal throttling CPU. Ops engineers need to identify hardware issues quickly, understand diagnostic tools, and know when a component is dying before it takes down production.

Server Components

CPU

Modern servers use 1-2 sockets with multi-core processors. Key specs: - Cores/Threads: 16-64+ cores per socket, 2 threads per core (SMT/HT) - TDP: Thermal Design Power (wattage, affects cooling) - Cache: L1 (per-core), L2 (per-core), L3 (shared) - NUMA: Non-Uniform Memory Access — each socket has local memory

# CPU info
lscpu
cat /proc/cpuinfo | grep "model name" | head -1

# NUMA topology
numactl --hardware
lscpu | grep NUMA

# CPU frequency and throttling
cat /proc/cpuinfo | grep MHz
cpupower frequency-info

Name origin: NUMA stands for Non-Uniform Memory Access. In a multi-socket server, each CPU socket has its own local memory bank. Accessing local memory takes ~80ns, but accessing the other socket's memory (remote) takes ~130ns — a 60% penalty. The numactl --hardware command shows the topology. Processes that accidentally allocate memory across NUMA nodes see unpredictable latency spikes. This is why database tuning guides always mention NUMA pinning.

Memory (DIMMs)

Server memory is ECC (Error-Correcting Code) DDR4/DDR5 in DIMM form factor.

ECC Memory: Detects and corrects single-bit errors. Logs correctable errors (CE). Uncorrectable errors (UE) cause machine check exceptions (MCE) — usually a kernel panic.

# Memory info
dmidecode -t memory | grep -E "Size|Type|Speed|Manufacturer"

# Total memory
free -h

# ECC error counts
edac-util -s              # EDAC summary
edac-util -l              # per-DIMM errors
cat /sys/devices/system/edac/mc/mc0/csrow0/ce_count

# Machine check exceptions
mcelog --client           # if mcelog daemon is running
journalctl | grep -i "mce\|machine check"

DIMM failure symptoms: - Correctable ECC errors increasing over time - Random kernel panics (MCE) - Processes killed by OOM (bad DIMM reduces usable memory) - mcelog or edac-util reporting errors

NIC (Network Interface Card)

# List NICs
lspci | grep -i ethernet
ip link show

# NIC details
ethtool eth0              # link status, speed, duplex
ethtool -i eth0           # driver, firmware version
ethtool -S eth0           # error counters

# Check for errors
ethtool -S eth0 | grep -E "error|drop|miss|crc"

NIC failure symptoms: increasing CRC errors, link flapping, rx/tx drops.

HBA (Host Bus Adapter) and RAID Controllers

HBAs connect servers to SAN storage. RAID controllers manage local disk arrays.

# List storage controllers
lspci | grep -i "raid\|storage\|scsi\|sas"

# MegaRAID (Dell PERC, many vendors)
megacli -LDInfo -Lall -aALL        # logical drives
megacli -PDList -aALL              # physical disks
storcli /c0 show                   # modern replacement

# Check disk health
smartctl -a /dev/sda               # SMART data
smartctl -H /dev/sda               # health status

Disk Drives

# List block devices
lsblk
fdisk -l

# SMART monitoring
smartctl -a /dev/sda
smartctl -t short /dev/sda         # run short self-test
smartctl -l selftest /dev/sda      # view test results

# NVMe drives
nvme list
nvme smart-log /dev/nvme0

Key SMART attributes to watch: - Reallocated Sector Count: Bad sectors remapped (increasing = drive dying) - Current Pending Sector Count: Sectors waiting to be remapped - Uncorrectable Error Count: Read errors that could not be recovered - Temperature: Overheating shortens drive life

Diagnostic Tools

lshw — Hardware Lister

# Full inventory
lshw -short

# Specific class
lshw -class memory
lshw -class network
lshw -class disk
lshw -class processor

# JSON output (for automation)
lshw -json -short

dmidecode — SMBIOS/DMI Data

Reads hardware information from BIOS/UEFI tables:

# Full dump
dmidecode

# Specific type
dmidecode -t system          # manufacturer, model, serial
dmidecode -t baseboard       # motherboard info
dmidecode -t memory          # DIMM details
dmidecode -t processor       # CPU details
dmidecode -t bios            # BIOS version, date

# Quick serial number
dmidecode -s system-serial-number

dmesg — Kernel Messages

Hardware errors appear in the kernel ring buffer:

# Recent hardware messages
dmesg -T | tail -50

# Filter for errors
dmesg -T -l err,crit,alert,emerg

# Hardware-specific
dmesg | grep -i "error\|fault\|fail\|hardware\|mce"
dmesg | grep -i "eth0\|nvme\|sda"

# Watch live
dmesg -Tw

Under the hood: IPMI (Intelligent Platform Management Interface) was standardized by Intel, HP, NEC, and Dell in 1998. The BMC (Baseboard Management Controller) is a separate processor on the server motherboard that operates independently of the main CPU and OS. It is always powered on (even when the server is off), connected to its own network port, and runs its own firmware. Vendor-specific implementations include Dell iDRAC, HPE iLO, Lenovo XClarity, and Supermicro IPMI. These let you power cycle a server, access the console, and read sensors — all without the OS being up.

ipmitool — Hardware Monitoring

# Temperature sensors
ipmitool sensor list | grep -i temp

# Fan speeds
ipmitool sensor list | grep -i fan

# All sensor readings
ipmitool sdr list full

# System Event Log (hardware events)
ipmitool sel elist

# Chassis status
ipmitool chassis status

Hardware Diagnostics Workflow

When you suspect a hardware issue:

  1. Check dmesg: dmesg -T -l err,crit — look for MCE, I/O errors, link down
  2. Check SEL: ipmitool sel elist — BMC-logged events
  3. Check SMART: smartctl -a /dev/sdX — disk health
  4. Check ECC: edac-util -s — memory errors
  5. Check NIC: ethtool -S eth0 — network errors
  6. Check sensors: ipmitool sensor list — thermal, voltage, fan
  7. Check vendor logs: iDRAC/iLO web UI for detailed diagnostics

Thermal Management

# CPU temperature
sensors                      # lm-sensors package
cat /sys/class/thermal/thermal_zone0/temp

# IPMI temperatures
ipmitool sensor list | grep -i temp

Thermal throttling: when CPU exceeds thermal limits, it reduces frequency. Causes: - Failed fans - Blocked airflow (cables, blanking panels missing) - Ambient temperature too high (CRAC failure) - Dust buildup

Gotcha: SMART data is not always reliable for predicting imminent drive failure. Google's 2007 study "Failure Trends in a Large Disk Drive Population" found that 36% of failed drives had zero SMART warnings beforehand. However, drives with even one reallocated sector are 14x more likely to fail within 60 days. Treat any SMART attribute crossing a threshold as an urgent replacement signal, but do not assume clean SMART data means the drive is safe.

Debug clue: When a server intermittently reboots with no kernel panic in dmesg, check the IPMI System Event Log (ipmitool sel elist). Hardware-initiated reboots (power supply glitch, thermal shutdown, watchdog timeout) are logged by the BMC but invisible to the OS since the OS was not running when the event occurred.

Quick Reference

Task Command
Hardware inventory lshw -short
System info dmidecode -t system
Memory details dmidecode -t memory
CPU info lscpu
NIC status ethtool eth0
Disk health smartctl -a /dev/sda
ECC errors edac-util -s
Kernel errors dmesg -T -l err,crit
Sensor readings ipmitool sensor list
Event log ipmitool sel elist
NVMe health nvme smart-log /dev/nvme0

Wiki Navigation