datacenter
l1
topic-pack
server-hardware --- Portal | Level: L1: Foundations | Topics: Server Hardware | Domain: Datacenter & Hardware

Server Hardware - Primer¶

Why This Matters¶

Software runs on hardware. When a server panics, drops packets, or corrupts data, the root cause is often physical: a failing DIMM, a degraded NIC, a thermal throttling CPU. Ops engineers need to identify hardware issues quickly, understand diagnostic tools, and know when a component is dying before it takes down production.

Server Components¶

CPU¶

Modern servers use 1-2 sockets with multi-core processors. Key specs: - Cores/Threads: 16-64+ cores per socket, 2 threads per core (SMT/HT) - TDP: Thermal Design Power (wattage, affects cooling) - Cache: L1 (per-core), L2 (per-core), L3 (shared) - NUMA: Non-Uniform Memory Access — each socket has local memory

# CPU info
lscpu
cat /proc/cpuinfo | grep "model name" | head -1

# NUMA topology
numactl --hardware
lscpu | grep NUMA

# CPU frequency and throttling
cat /proc/cpuinfo | grep MHz
cpupower frequency-info

Name origin: NUMA stands for Non-Uniform Memory Access. In a multi-socket server, each CPU socket has its own local memory bank. Accessing local memory takes ~80ns, but accessing the other socket's memory (remote) takes ~130ns — a 60% penalty. The numactl --hardware command shows the topology. Processes that accidentally allocate memory across NUMA nodes see unpredictable latency spikes. This is why database tuning guides always mention NUMA pinning.

Memory (DIMMs)¶

Server memory is ECC (Error-Correcting Code) DDR4/DDR5 in DIMM form factor.

ECC Memory: Detects and corrects single-bit errors. Logs correctable errors (CE). Uncorrectable errors (UE) cause machine check exceptions (MCE) — usually a kernel panic.

# Memory info
dmidecode -t memory | grep -E "Size|Type|Speed|Manufacturer"

# Total memory
free -h

# ECC error counts
edac-util -s              # EDAC summary
edac-util -l              # per-DIMM errors
cat /sys/devices/system/edac/mc/mc0/csrow0/ce_count

# Machine check exceptions
mcelog --client           # if mcelog daemon is running
journalctl | grep -i "mce\|machine check"

DIMM failure symptoms: - Correctable ECC errors increasing over time - Random kernel panics (MCE) - Processes killed by OOM (bad DIMM reduces usable memory) - mcelog or edac-util reporting errors

NIC (Network Interface Card)¶

# List NICs
lspci | grep -i ethernet
ip link show

# NIC details
ethtool eth0              # link status, speed, duplex
ethtool -i eth0           # driver, firmware version
ethtool -S eth0           # error counters

# Check for errors
ethtool -S eth0 | grep -E "error|drop|miss|crc"

NIC failure symptoms: increasing CRC errors, link flapping, rx/tx drops.

HBA (Host Bus Adapter) and RAID Controllers¶

HBAs connect servers to SAN storage. RAID controllers manage local disk arrays.

# List storage controllers
lspci | grep -i "raid\|storage\|scsi\|sas"

# MegaRAID (Dell PERC, many vendors)
megacli -LDInfo -Lall -aALL        # logical drives
megacli -PDList -aALL              # physical disks
storcli /c0 show                   # modern replacement

# Check disk health
smartctl -a /dev/sda               # SMART data
smartctl -H /dev/sda               # health status

Disk Drives¶

# List block devices
lsblk
fdisk -l

# SMART monitoring
smartctl -a /dev/sda
smartctl -t short /dev/sda         # run short self-test
smartctl -l selftest /dev/sda      # view test results

# NVMe drives
nvme list
nvme smart-log /dev/nvme0

Key SMART attributes to watch: - Reallocated Sector Count: Bad sectors remapped (increasing = drive dying) - Current Pending Sector Count: Sectors waiting to be remapped - Uncorrectable Error Count: Read errors that could not be recovered - Temperature: Overheating shortens drive life

Diagnostic Tools¶

lshw — Hardware Lister¶

# Full inventory
lshw -short

# Specific class
lshw -class memory
lshw -class network
lshw -class disk
lshw -class processor

# JSON output (for automation)
lshw -json -short

dmidecode — SMBIOS/DMI Data¶

Reads hardware information from BIOS/UEFI tables:

# Full dump
dmidecode

# Specific type
dmidecode -t system          # manufacturer, model, serial
dmidecode -t baseboard       # motherboard info
dmidecode -t memory          # DIMM details
dmidecode -t processor       # CPU details
dmidecode -t bios            # BIOS version, date

# Quick serial number
dmidecode -s system-serial-number

dmesg — Kernel Messages¶

Hardware errors appear in the kernel ring buffer:

# Recent hardware messages
dmesg -T | tail -50

# Filter for errors
dmesg -T -l err,crit,alert,emerg

# Hardware-specific
dmesg | grep -i "error\|fault\|fail\|hardware\|mce"
dmesg | grep -i "eth0\|nvme\|sda"

# Watch live
dmesg -Tw

Under the hood: IPMI (Intelligent Platform Management Interface) was standardized by Intel, HP, NEC, and Dell in 1998. The BMC (Baseboard Management Controller) is a separate processor on the server motherboard that operates independently of the main CPU and OS. It is always powered on (even when the server is off), connected to its own network port, and runs its own firmware. Vendor-specific implementations include Dell iDRAC, HPE iLO, Lenovo XClarity, and Supermicro IPMI. These let you power cycle a server, access the console, and read sensors — all without the OS being up.

ipmitool — Hardware Monitoring¶

# Temperature sensors
ipmitool sensor list | grep -i temp

# Fan speeds
ipmitool sensor list | grep -i fan

# All sensor readings
ipmitool sdr list full

# System Event Log (hardware events)
ipmitool sel elist

# Chassis status
ipmitool chassis status

Hardware Diagnostics Workflow¶

When you suspect a hardware issue:

Check dmesg: dmesg -T -l err,crit — look for MCE, I/O errors, link down
Check SEL: ipmitool sel elist — BMC-logged events
Check SMART: smartctl -a /dev/sdX — disk health
Check ECC: edac-util -s — memory errors
Check NIC: ethtool -S eth0 — network errors
Check sensors: ipmitool sensor list — thermal, voltage, fan
Check vendor logs: iDRAC/iLO web UI for detailed diagnostics

Thermal Management¶

# CPU temperature
sensors                      # lm-sensors package
cat /sys/class/thermal/thermal_zone0/temp

# IPMI temperatures
ipmitool sensor list | grep -i temp

Thermal throttling: when CPU exceeds thermal limits, it reduces frequency. Causes: - Failed fans - Blocked airflow (cables, blanking panels missing) - Ambient temperature too high (CRAC failure) - Dust buildup

Gotcha: SMART data is not always reliable for predicting imminent drive failure. Google's 2007 study "Failure Trends in a Large Disk Drive Population" found that 36% of failed drives had zero SMART warnings beforehand. However, drives with even one reallocated sector are 14x more likely to fail within 60 days. Treat any SMART attribute crossing a threshold as an urgent replacement signal, but do not assume clean SMART data means the drive is safe.

Debug clue: When a server intermittently reboots with no kernel panic in dmesg, check the IPMI System Event Log (ipmitool sel elist). Hardware-initiated reboots (power supply glitch, thermal shutdown, watchdog timeout) are logged by the BMC but invisible to the OS since the OS was not running when the event occurred.

Quick Reference¶

Task	Command
Hardware inventory	`lshw -short`
System info	`dmidecode -t system`
Memory details	`dmidecode -t memory`
CPU info	`lscpu`
NIC status	`ethtool eth0`
Disk health	`smartctl -a /dev/sda`
ECC errors	`edac-util -s`
Kernel errors	`dmesg -T -l err,crit`
Sensor readings	`ipmitool sensor list`
Event log	`ipmitool sel elist`
NVMe health	`nvme smart-log /dev/nvme0`

Bare-Metal Provisioning (Topic Pack, L2) — Server Hardware
Case Study: BIOS Settings Reset After CMOS (Case Study, L1) — Server Hardware
Case Study: Cable Management Wrong Port (Case Study, L1) — Server Hardware
Case Study: Database Replication Lag — Root Cause Is RAID Degradation (Case Study, L2) — Server Hardware
Case Study: Firmware Update Boot Loop (Case Study, L2) — Server Hardware
Case Study: Link Flaps Bad Optic (Case Study, L1) — Server Hardware
Case Study: Memory ECC Errors Increasing (Case Study, L1) — Server Hardware
Case Study: Power Supply Redundancy Lost (Case Study, L1) — Server Hardware
Case Study: Serial Console Garbled (Case Study, L1) — Server Hardware
Case Study: Server Intermittent Reboot (Case Study, L2) — Server Hardware

Server Hardware - Primer¶

Why This Matters¶

Server Components¶

CPU¶

Memory (DIMMs)¶

NIC (Network Interface Card)¶

HBA (Host Bus Adapter) and RAID Controllers¶

Disk Drives¶

Diagnostic Tools¶

lshw — Hardware Lister¶

dmidecode — SMBIOS/DMI Data¶

dmesg — Kernel Messages¶

ipmitool — Hardware Monitoring¶

Hardware Diagnostics Workflow¶

Thermal Management¶

Quick Reference¶

Wiki Navigation¶

Pages that link here¶

Server Hardware - Primer¶

Why This Matters¶

Server Components¶

CPU¶

Memory (DIMMs)¶

NIC (Network Interface Card)¶

HBA (Host Bus Adapter) and RAID Controllers¶

Disk Drives¶

Diagnostic Tools¶

lshw — Hardware Lister¶

dmidecode — SMBIOS/DMI Data¶

dmesg — Kernel Messages¶

ipmitool — Hardware Monitoring¶

Hardware Diagnostics Workflow¶

Thermal Management¶

Quick Reference¶

Wiki Navigation¶

Related Content¶

Pages that link here¶