Server Hardware: When the Blinky Lights Matter
- lesson
- server-hardware
- ipmi/bmc
- redfish-api
- smart-monitoring
- ecc-memory
- raid
- thermal-management
- power-supplies
- hardware-diagnostics
- pxe-boot
- server-lifecycle ---# Server Hardware — When the Blinky Lights Matter
Topics: server hardware, IPMI/BMC, Redfish API, SMART monitoring, ECC memory, RAID, thermal management, power supplies, hardware diagnostics, PXE boot, server lifecycle Level: L1–L2 (Foundations to Operations) Time: 75–90 minutes Prerequisites: None (everything is explained from scratch)
The Mission¶
It's 11:47pm on a Tuesday. Your pager fires: web-prod-07 is intermittently dropping
connections. Customers are seeing timeouts. The application team says "the code hasn't
changed." The network team says "our switches are fine." It's your server.
You're at home. The server is in a datacenter 300 miles away. Nobody is on-site. You have one tool that still works when everything else is broken: the BMC — a tiny computer living inside the server, always awake, always listening, connected to its own management network.
You're going to diagnose this remotely. Along the way, you'll learn what's actually inside a server, how the components talk to each other, how to interrogate them from your couch, and why a single bad memory stick can ruin your week.
Part 1: Two Computers, One Chassis¶
Before you can diagnose anything, you need to understand the most important fact about enterprise servers that nobody explains clearly: every server is actually two computers.
┌─────────────────────────────────────────────────┐
│ The Server You Know │
│ ┌───────┐ ┌───────┐ ┌──────┐ ┌──────┐ │
│ │ CPU │ │ RAM │ │ NIC │ │ Disk │ │
│ └───┬───┘ └───┬───┘ └──┬───┘ └──┬───┘ │
│ └──────────┴─────────┴─────────┘ │
│ │ │
│ ┌──────────────────┴────────────────────────┐ │
│ │ BMC (Baseboard Management Controller) │ │
│ │ - Its own ARM CPU │ │
│ │ - Its own RAM (256MB–1GB) │ │
│ │ - Its own NIC (dedicated or shared) │ │
│ │ - Its own flash storage │ │
│ │ - Always on — even when the server is │ │
│ │ "off" — as long as AC power is present │ │
│ └───────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
The BMC is powered by the 5V standby rail from the power supply. Plug a server into a wall outlet and the BMC boots — even if you never press the power button. It has its own IP address, its own login, and it can see everything the main system cannot report about itself.
Name Origin: BMC stands for Baseboard Management Controller. "Baseboard" is an old term for the motherboard — the base board that everything plugs into. The BMC is a controller chip soldered directly onto it. Every vendor wraps their BMC in a branded name: Dell calls theirs iDRAC (Integrated Dell Remote Access Controller), HPE calls theirs iLO (Integrated Lights-Out — because you can manage the server with the datacenter lights off), Supermicro just calls it "IPMI BMC." They all speak the same protocol underneath.
Name Origin: IPMI stands for Intelligent Platform Management Interface. Intel published the first spec in 1998. "Intelligent" because the BMC can make decisions autonomously (throttle fans, log events, even shut down the server to prevent damage). "Platform Management" because it manages the physical platform, not the software running on it.
The protocol you'll use tonight¶
IPMI messages travel over RMCP+ (Remote Management Control Protocol Plus) on UDP port
623. The + means encryption — the original RMCP had none. The CLI tool is ipmitool,
and it works with every vendor's BMC.
# The pattern you'll type a hundred times in your career:
ipmitool -I lanplus -H <bmc-ip> -U <user> -P <password> <command>
| Flag | Meaning |
|---|---|
-I lanplus |
Use IPMI 2.0 with encryption (always use this, never -I lan) |
-H |
BMC IP address (on the management network, not the server's production IP) |
-U |
Username (default: root on Dell, ADMIN on Supermicro, Administrator on HPE) |
-P |
Password (default: calvin on Dell — yes, really) |
Gotcha: The
-Pflag puts your password in the process list, visible to anyone who runsps aux. For scripts, use-E(reads from theIPMI_PASSWORDenvironment variable) or-f(reads from a file). For tonight's emergency,-Pis fine — you're the only one on this laptop.
Part 2: First Contact — Is This Server Even Alive?¶
You pull up a terminal and type:
OK. The server is on. That rules out a power failure. Let's get a quick health summary.
ipmitool -I lanplus -H $BMC -U admin -P $PASS chassis status
# System Power : on
# Power Overload : false
# Main Power Fault : false
# Cooling/Fan Fault : false
# Drive Fault : false
# Front-Panel Lockout : inactive
# Chassis Intrusion : inactive
No screaming red flags. But chassis status is a surface check — it shows what the BMC
considers a current fault, not what happened twenty minutes ago.
For that, you need the event log.
Part 3: Reading the Black Box — The System Event Log¶
Every BMC maintains a System Event Log (SEL) — a circular buffer of hardware events stored in non-volatile flash. Temperature spikes, fan failures, power supply glitches, ECC memory errors — all recorded here, even events the OS never saw.
You stare at the output:
1 | 03/18/2026 | 23:15:01 | Memory #0x20 | Correctable ECC | Asserted
2 | 03/18/2026 | 23:15:01 | Memory #0x20 | Correctable ECC | Asserted
3 | 03/18/2026 | 23:15:02 | Memory #0x20 | Correctable ECC | Asserted
4 | 03/18/2026 | 23:15:02 | Memory #0x20 | Correctable ECC | Asserted
5 | 03/18/2026 | 23:15:03 | Memory #0x20 | Correctable ECC | Asserted
...
18 | 03/18/2026 | 23:31:17 | Memory #0x20 | Correctable ECC | Asserted
19 | 03/18/2026 | 23:31:17 | Memory #0x20 | Correctable ECC | Asserted
20 | 03/18/2026 | 23:42:44 | Memory #0x20 | Correctable ECC | Asserted
Twenty events in 27 minutes. All correctable ECC errors. All from the same memory sensor.
Your heart rate picks up. You know what this means.
Under the Hood: ECC stands for Error-Correcting Code. Server RAM uses ECC — each 64-bit word has 8 extra bits that form a Hamming code, allowing the memory controller to detect and correct single-bit errors on the fly, and detect (but not correct) double-bit errors. A correctable error (CE) is a single-bit flip that was silently fixed. Your application never noticed. But a correctable error is a canary — the DIMM is degrading. Eventually, a double-bit flip happens. That's an uncorrectable error (UE), and it triggers a Machine Check Exception (MCE) — an immediate kernel panic.
Trivia: Google published a landmark study in 2009 on DRAM errors across their fleet. They found that about 8% of DIMMs experience at least one correctable error per year. Servers with one correctable error were 13–228 times more likely to see another one. Memory errors are not rare events — they're a fact of life at scale.
How many errors are too many?¶
A handful of CEs per year per DIMM is normal — cosmic rays, voltage transients, the universe being the universe. But a burst of CEs from a single DIMM — dozens in minutes — means the silicon is failing. Dell's guideline: 24 or more correctable errors in 24 hours from the same DIMM = schedule replacement.
You have 20 in 27 minutes. This DIMM is dying.
Flashcard Check #1¶
| Question | Answer |
|---|---|
| What does ECC stand for? | Error-Correcting Code |
| What's the difference between a correctable (CE) and uncorrectable (UE) error? | CE = single-bit flip, silently fixed. UE = multi-bit flip, triggers MCE/kernel panic. |
| Where do you find memory error events on a server? | ipmitool sel elist (BMC event log) and edac-util -s (OS-level) |
| Why does a burst of CEs matter if each one is "corrected"? | It signals the DIMM is degrading. An uncorrectable error (kernel panic) is likely imminent. |
Part 4: Finding the Bad DIMM¶
You know something in memory is failing. Now you need to know which stick, so you can tell the datacenter team exactly what to replace.
From the OS side (if you can still SSH in)¶
# EDAC — Error Detection And Correction subsystem in the Linux kernel
edac-util -s
# mc0: 0 Uncorrectable Errors, 47 Correctable Errors
edac-util -l
# mc0: csrow0: ch0: 47 Correctable Errors ← all from one channel
# mc0: csrow0: ch1: 0 Correctable Errors
# mc0: csrow1: ch0: 0 Correctable Errors
# mc0: csrow1: ch1: 0 Correctable Errors
mc0: csrow0: ch0 — memory controller 0, chip-select row 0, channel 0. But which physical
slot is that?
# Cross-reference with dmidecode
dmidecode -t memory | grep -A 10 "Locator: DIMM_A1"
# Locator: DIMM_A1
# Bank Locator: P0_Node0_Channel0_Dimm0
# Type: DDR4
# Size: 32 GB
# Speed: 3200 MT/s
# Manufacturer: Samsung
# Serial Number: 3A2B4C5D
# Part Number: M393A4K40DB3-CWE
Under the Hood:
dmidecodereads the SMBIOS (System Management BIOS) tables — a data structure in firmware that describes every physical component. The-t memoryflag pulls the type 17 (Memory Device) records, which include the physical slot label, capacity, speed, manufacturer, and serial number. This is how you go from "channel 0, row 0" to "the Samsung DIMM in slot A1."
Now you can email the datacenter team: "Replace the 32GB DDR4 in slot DIMM_A1, serial 3A2B4C5D, on web-prod-07 in rack E-42, unit 15."
But why is it dropping connections?¶
A correctable ECC error is fixed transparently — the application should not notice. So why are connections dropping?
Here's the thing: correcting an ECC error takes time. Not much — a few hundred nanoseconds. But if errors are bursting at a high rate, the memory controller spends measurable time on corrections. The CPU stalls waiting for corrected data. Under heavy load, these micro-stalls accumulate. Network packets sit in receive buffers too long. TCP retransmits. Applications see timeouts.
It's not a crash. It's death by a thousand paper cuts.
War Story: A team at a major hosting provider spent three days debugging intermittent latency spikes on a cluster of database servers. Application profiling showed nothing. Network traces showed retransmits but no packet loss at the switch. They finally checked
edac-utiland found one server generating 500+ correctable ECC errors per hour from a single DIMM. The memory controller's correction overhead was adding 2–5ms of jitter to every memory-intensive operation. Replacing a $40 DIMM fixed a problem that had consumed $15,000 in engineering time. The lesson: check the hardware before blaming the software. The SEL would have shown this on day one.
Part 5: What Else Could Be Wrong? — The Full Sensor Sweep¶
You've found the smoking gun, but good triage means checking everything. A DIMM failure could be a symptom of a deeper problem — overheating, a bad power rail, a failing motherboard.
# Full sensor dump
ipmitool -I lanplus -H $BMC -U admin -P $PASS sdr list
# Inlet Temp | 24 degrees C | ok
# Exhaust Temp | 38 degrees C | ok
# CPU1 Temp | 62 degrees C | ok
# CPU2 Temp | 59 degrees C | ok
# Fan1 | 8400 RPM | ok
# Fan2 | 8520 RPM | ok
# Fan3 | 8280 RPM | ok
# Fan4 | 8640 RPM | ok
# PSU1 Status | 0x01 | ok
# PSU2 Status | 0x01 | ok
# DIMM PG | 0x01 | ok
# VCORE | 0.88 Volts | ok
# 12V Rail | 12.13 Volts | ok
# 3.3V Rail | 3.32 Volts | ok
Everything looks normal. Temps are fine, fans are spinning, both PSUs are healthy, voltage rails are within spec. This isn't a thermal or power problem — it's an isolated DIMM failure.
Sensor types at a glance¶
| Type | What It Monitors | "Oh no" Reading |
|---|---|---|
| Temperature | CPU, inlet air, exhaust, DIMMs | Inlet > 35C, CPU > 90C |
| Fan | Fan RPM | 0 RPM = dead fan |
| Voltage | CPU Vcore, 3.3V, 5V, 12V rails | Outside +/- 5% of nominal |
| Power Supply | PSU presence and health | Status = critical or absent |
| Memory | ECC errors, DIMM presence | Correctable errors spiking |
| Drive | Disk fault indicator | Any fault = imminent failure |
Zooming in on a single sensor¶
ipmitool -I lanplus -H $BMC -U admin -P $PASS sensor get "Inlet Temp"
# Sensor ID : Inlet Temp (0x4)
# Sensor Type (Analog) : Temperature
# Sensor Reading : 24 (+/- 0) degrees C
# Status : ok
# Upper Non-Critical : 42.000
# Upper Critical : 47.000
Those thresholds matter. When the reading crosses "Upper Non-Critical" (42C), the BMC logs a warning. Cross "Upper Critical" (47C) and it logs a critical alert — and may start throttling CPUs or ramping fans to maximum.
Part 6: The Modern Way — Redfish API¶
Everything you just did with ipmitool — you can also do with curl. Redfish is the modern
REST-based replacement for IPMI, and every server shipped since ~2017 speaks it.
Name Origin: "Redfish" was deliberately chosen by the DMTF (Distributed Management Task Force) as an approachable name — a break from the alphabet soup of IPMI, SMASH, and WS-Management. The first spec was published in August 2015.
Why Redfish over IPMI?¶
| IPMI | Redfish | |
|---|---|---|
| Transport | UDP 623, binary protocol | HTTPS (TCP 443), JSON |
| Auth | RAKP — leaks password hashes by design | Basic Auth + Session Tokens + TLS |
| Extensibility | Vendor OEM commands (undocumented blobs) | JSON with schemas, OEM extensions are readable |
| Scriptability | Parse ipmitool text output (fragile) |
curl + jq (rock solid) |
Let's redo the diagnosis with curl¶
BMC=10.0.10.47
CREDS="admin:password"
# Power state
curl -sk -u $CREDS https://$BMC/redfish/v1/Systems/System.Embedded.1 \
| jq '.PowerState'
# "On"
# System health summary
curl -sk -u $CREDS https://$BMC/redfish/v1/Systems/System.Embedded.1 \
| jq '{
Model, SerialNumber, PowerState,
Health: .Status.Health,
CPUs: .ProcessorSummary.Count,
RAM_GB: .MemorySummary.TotalSystemMemoryGiB
}'
# {
# "Model": "PowerEdge R750",
# "SerialNumber": "ABC1234",
# "PowerState": "On",
# "Health": "Warning",
# "CPUs": 2,
# "RAM_GB": 256
# }
"Health: Warning" — Redfish is already telling you something is wrong. Let's pull the SEL:
# Recent SEL events that aren't OK
curl -sk -u $CREDS \
https://$BMC/redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Sel/Entries \
| jq '[.Members[] | select(.Severity != "OK")]
| sort_by(.Created) | reverse | .[:5]
| .[] | {Created, Message, Severity}'
{
"Created": "2026-03-18T23:42:44+00:00",
"Message": "A correctable memory error was detected on DIMM_A1.",
"Severity": "Warning"
}
There it is. Same story, but now you get a DIMM slot name directly in the JSON — no
cross-referencing with dmidecode needed.
Temperature and power via Redfish¶
# Thermal health
curl -sk -u $CREDS \
https://$BMC/redfish/v1/Chassis/System.Embedded.1/Thermal \
| jq '[.Temperatures[] | {Name, ReadingCelsius, Health: .Status.Health}]'
# Power consumption
curl -sk -u $CREDS \
https://$BMC/redfish/v1/Chassis/System.Embedded.1/Power \
| jq '{
Watts: .PowerControl[0].PowerConsumedWatts,
PSUs: [.PowerSupplies[] | {Name, Health: .Status.Health}]
}'
Gotcha: Redfish URIs differ by vendor. Dell uses
/redfish/v1/Systems/System.Embedded.1. HPE uses/redfish/v1/Systems/1. Never hardcode URIs in automation — always discover from the service root:
Flashcard Check #2¶
| Question | Answer |
|---|---|
| What transport does Redfish use? | HTTPS (TCP 443) with JSON payloads |
| What transport does IPMI use? | RMCP/RMCP+ over UDP 623 |
| Why is IPMI's RAKP authentication dangerous? | It returns a password hash to any unauthenticated client — crackable offline (CVE-2013-4786) |
| How do you discover the correct System URI in Redfish? | GET /redfish/v1/Systems and read .Members[0]."@odata.id" |
Part 7: The Server's Serial Console — SOL¶
What if you can't SSH in at all? The server is up (BMC says power is on) but the OS is unreachable. You need to see what's on the screen.
Serial-over-LAN (SOL) tunnels the server's serial console through the BMC to your terminal. You see POST messages, GRUB menus, kernel boot output, kernel panics — everything.
You're now looking at the server's console. If the OS is at a login prompt, you can type credentials. If the kernel is panicking, you see the panic message. If it's stuck in GRUB, you can choose a different kernel.
To disconnect: type ~. (tilde, then period). If you're connected through SSH, use ~~.
to avoid triggering SSH's own escape.
Gotcha: Only one SOL session at a time. If you get "SOL session already active," a stale session exists. Kill it first:
Then reconnect.
SOL has no Redfish equivalent. This is the main reason ipmitool persists in modern
environments — when you need an interactive console, SOL is it.
Part 8: The Physical Layer You Can't Ignore¶
Now that the immediate crisis is under control (DIMM identified, replacement scheduled, monitoring alert set), let's zoom out. What's actually in that server, and how does it all fit together?
Form Factors¶
| Form Factor | Height | Typical Use | Trade-offs |
|---|---|---|---|
| 1U | 1.75 inches | Web servers, compute nodes | Fewer drive bays, limited PCIe, louder fans |
| 2U | 3.5 inches | General purpose, storage | Good balance of density and expandability |
| 4U | 7 inches | GPU servers, dense storage | Lots of room, lots of power, lots of heat |
| Blade | Varies | High-density compute | Shared chassis, power, networking — complex |
Trivia: A 1U server's fans move 80–120 CFM (cubic feet per minute) of air. A full rack of 42 such servers pushes 3,300–5,000 CFM — roughly equivalent to a whole-house fan. At full speed, server fans produce over 70 dB. Datacenter workers wear hearing protection during extended rack work.
CPU and NUMA — Why Memory Location Matters¶
Modern servers have 1–2 CPU sockets. Each socket has its own local memory bank. This architecture is called NUMA (Non-Uniform Memory Access).
┌─────────────────┐ ┌─────────────────┐
│ CPU Socket 0 │ │ CPU Socket 1 │
│ 16-64 cores │ │ 16-64 cores │
└────────┬────────┘ └────────┬────────┘
│ │
┌──────┴──────┐ ┌──────┴──────┐
│ Local RAM │ │ Local RAM │
│ ~80ns │ │ ~80ns │
│ (DIMM_A1-A8)│ │ (DIMM_B1-B8)│
└─────────────┘ └─────────────┘
│ Interconnect │
└───────────────────────┘
~130ns cross-socket
Accessing local memory: ~80 nanoseconds. Accessing the other socket's memory: ~130ns — a 60% penalty. If a database process on CPU 0 accidentally allocates all its memory on CPU 1's bank, every memory access pays that penalty. This is why database tuning guides always mention "NUMA pinning."
# See your NUMA topology
numactl --hardware
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
# node 0 size: 128000 MB
# node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
# node 1 size: 128000 MB
# node distances:
# node 0 1
# 0: 10 21
# 1: 21 10
Interview Bridge: "Explain NUMA and why it matters" is a common infrastructure interview question. The answer: memory access time depends on which CPU socket is accessing which memory bank. Performance-sensitive applications should be pinned to a single NUMA node using
numactl --cpunodebind=0 --membind=0 <command>.
Disks — SAS vs SATA vs NVMe¶
| SATA | SAS | NVMe | |
|---|---|---|---|
| Interface | SATA bus | SAS bus | PCIe lanes (direct to CPU) |
| Max throughput | ~600 MB/s | ~1,200 MB/s (12Gbps SAS) | ~7,000+ MB/s |
| Typical use | Bulk storage, cold data | Enterprise spinning disks, reliable SSDs | Primary storage, databases, anything fast |
| Hot-swap | Yes (with backplane) | Yes (with backplane) | Yes (U.2/U.3 form factors) |
| RAID controller needed? | Yes for hardware RAID | Yes for hardware RAID | No — NVMe talks directly to CPU |
Under the Hood: NVMe (Non-Volatile Memory Express), introduced in 2011, bypasses the traditional storage controller entirely. SATA and SAS drives talk to a RAID controller, which talks to the CPU over PCIe. NVMe drives connect directly to PCIe lanes, eliminating the controller bottleneck. The SATA-to-NVMe jump is the single largest performance improvement in storage history — bigger than the HDD-to-SSD transition.
SMART — Your Disks Are Talking to You¶
Every disk tracks its own health using SMART (Self-Monitoring, Analysis and Reporting Technology). Three attributes predict failure:
smartctl -A /dev/sda | grep -E "Reallocated|Pending|Uncorrect"
# 5 Reallocated_Sector_Ct 0 ← bad sectors remapped (>0 = warning)
# 197 Current_Pending_Sector 0 ← sectors awaiting remap (>0 = warning)
# 198 Offline_Uncorrectable 0 ← unrecoverable read errors (>0 = replace now)
Remember: The SMART death trio: R-P-U. Reallocated (bad sectors already moved to spare area), Pending (sectors waiting for a retry or remap), Uncorrectable (sectors that failed and can't be fixed). Any non-zero on these three = start planning a replacement.
# NVMe drives have their own health metric
nvme smart-log /dev/nvme0
# critical_warning : 0
# temperature : 35 C
# percentage_used : 3% ← 100% = rated endurance reached
# media_errors : 0
Trivia: Google's 2007 study "Failure Trends in a Large Disk Drive Population" found that 36% of failed drives had zero SMART warnings beforehand. But drives with even one reallocated sector were 14x more likely to fail within 60 days. SMART isn't a crystal ball — but when it does warn you, listen.
RAID — Because Disks Will Fail¶
| Level | Minimum Disks | Survives | Use Case |
|---|---|---|---|
| RAID 0 | 2 | 0 failures (striping only) | Temp data, scratch |
| RAID 1 | 2 | 1 failure (mirror) | OS boot drives |
| RAID 5 | 3 | 1 failure (striping + parity) | Read-heavy general purpose |
| RAID 6 | 4 | 2 failures (double parity) | Large arrays, safety margin |
| RAID 10 | 4 | 1 per mirror pair | Databases, write-heavy |
# Check RAID status (MegaRAID / Dell PERC)
storcli /c0/vall show
# DG/VD TYPE State Access Consist Cache Size
# 0/0 RAID1 Optimal RW Yes RWBD 446.625 GB
# Software RAID
cat /proc/mdstat
# md0 : active raid1 sda1[0] sdb1[1]
# 1953513472 blocks [2/2] [UU] ← UU = both drives healthy
Gotcha: RAID is not a backup. A deleted file on RAID 10 is deleted across all mirrors simultaneously. RAID protects against hardware disk failure. It does NOT protect against accidental deletion, corruption, ransomware, or controller failure.
Part 9: Power — The Boring Thing That Ruins Everything¶
Redundant Power Supplies¶
Enterprise servers have two PSUs connected to separate PDUs (Power Distribution Units) fed by separate circuits — the A-feed and B-feed. If one circuit trips, the other keeps the server running.
# Check PSU status via IPMI
ipmitool -I lanplus -H $BMC -U admin -P $PASS sdr type "Power Supply"
# PSU1 Status | 0x01 | ok
# PSU2 Status | 0x01 | ok
# Power consumption
ipmitool -I lanplus -H $BMC -U admin -P $PASS dcmi power reading
# Instantaneous power reading: 287 Watts
# Minimum during sampling period: 245 Watts
# Maximum during sampling period: 342 Watts
# Average power reading: 278 Watts
Mental Model: Think of redundant PSUs like a two-lane bridge. Traffic flows through both lanes normally. If one lane closes, all traffic can still cross on the remaining lane — as long as the remaining lane has enough capacity. A 750W PSU in a server drawing 287W has plenty of headroom. A 500W PSU in a server drawing 450W at peak... doesn't.
Trivia: Server PSUs are rated by 80 Plus efficiency certification: Bronze (82–85%), Gold (87–90%), Platinum (90–94%), Titanium (91–96%). The difference between basic 80 Plus and Titanium at 50% load is about 16%. Across thousands of servers, that's millions of dollars per year in electricity.
What the Power Restore Policy means¶
ipmitool -I lanplus -H $BMC -U admin -P $PASS chassis status | grep "Power Restore"
# Power Restore Policy : always-on
This determines what happens after a power outage: - always-on — server auto-boots when power returns (datacenter default) - previous — returns to whatever state it was in before the outage - always-off — stays off until someone manually powers it on
Part 10: Hardware Inventory — dmidecode and lshw¶
Two tools you'll reach for every time you need to know what's physically in a server.
dmidecode — Reads BIOS/UEFI firmware tables¶
# System identification (what is this machine?)
dmidecode -t system
# Manufacturer: Dell Inc.
# Product Name: PowerEdge R750
# Serial Number: ABC1234
# All memory DIMMs
dmidecode -t memory | grep -E "Locator:|Size:|Speed:|Manufacturer:" | head -16
# Locator: DIMM_A1
# Size: 32 GB
# Speed: 3200 MT/s
# Manufacturer: Samsung
# Locator: DIMM_A2
# Size: 32 GB
# ...
# BIOS version (critical for firmware compliance)
dmidecode -t bios
# Version: 2.19.1
# Release Date: 01/15/2026
# One-liner: get the serial number
dmidecode -s system-serial-number
# ABC1234
lshw — Hardware lister¶
# Concise hardware summary
lshw -short
# H/W path Device Class Description
# /0/0 memory 256GiB System Memory
# /0/0/0 memory 32GiB DIMM DDR4 3200 MHz
# /0/100/1f.2 storage SATA Controller
# /0/100/3/0 eno1 network Ethernet Controller X710
# Filter by class
lshw -class network
lshw -class disk
lshw -class memory
Gotcha:
dmidecoderequires root and reads firmware tables. In virtual machines, it reports the hypervisor's emulated hardware, not physical hardware. Runsystemd-detect-virtfirst to check if you're on bare metal before trustingdmidecodefor inventory.
Part 11: Thermal Management — When the Data Center Gets Hot¶
What if the problem had been thermal? Here's how you'd know.
# CPU temperatures via lm-sensors
sensors
# coretemp-isa-0000
# Core 0: +62.0°C (high = +85.0°C, crit = +100.0°C)
# Core 1: +60.0°C
# IPMI inlet and exhaust temps
ipmitool sensor list | grep -i temp
# Inlet Temp | 24.000 | degrees C | ok
# Exhaust Temp | 38.000 | degrees C | ok
# CPU1 Temp | 62.000 | degrees C | ok
Thermal throttling is when the CPU reduces its clock speed to prevent overheating. Your 3.0 GHz processor suddenly runs at 1.2 GHz. Applications slow to a crawl.
Common causes:
1. Dead fan — ipmitool sensor list | grep -i fan shows 0 RPM
2. Blocked airflow — missing blanking panels in the rack, cables obstructing the intake
3. CRAC failure — the Computer Room Air Conditioner serving your row is down
4. Dust buildup — especially on air filters and heat sinks
# Check if CPU is actually throttling
cat /proc/cpuinfo | grep -i mhz | head -2
# cpu MHz : 1200.000 ← should be ~3000 if not throttled
# Check frequency governor
cpupower frequency-info
# current CPU frequency: 1.20 GHz ← throttled
Gotcha: Many Linux distributions ship with the
powersaveCPU governor, which intentionally runs CPUs at low frequency. On a production server, this adds latency. Setperformancemode:cpupower frequency-set -g performance. Make it permanent withtuned-adm profile throughput-performance.
Part 12: The Server Lifecycle¶
A server doesn't just appear in a rack. Here's the full journey from purchase order to recycling bin.
Procurement → Receive & Asset Tag → Rack & Cable → BIOS/BMC Config
→ PXE Boot → OS Install → Configuration Management → Burn-in Test
→ Production → Monitoring → Maintenance Windows
→ Decommission → Data Wipe → Disposal/Recycle
PXE Boot — The Network Bootstrap¶
When a server first boots, it has no operating system. PXE (Preboot eXecution Environment, pronounced "pixie") lets the NIC firmware download a bootloader over the network:
Server powers on → NIC PXE ROM → DHCP request → DHCP server replies with:
- IP address
- TFTP server address
- Boot filename (pxelinux.0 or GRUB)
→ NIC downloads bootloader via TFTP → Bootloader downloads kernel + initramfs
→ Kernel boots → Installer runs → Kickstart/Preseed automates answers
→ OS installed
You can trigger PXE boot remotely:
# Via IPMI
ipmitool -I lanplus -H $BMC -U admin -P $PASS chassis bootdev pxe
ipmitool -I lanplus -H $BMC -U admin -P $PASS power cycle
# Via Redfish
curl -sk -u $CREDS \
-X PATCH https://$BMC/redfish/v1/Systems/System.Embedded.1 \
-H 'Content-Type: application/json' \
-d '{"Boot": {"BootSourceOverrideTarget": "Pxe", "BootSourceOverrideEnabled": "Once"}}'
Name Origin: PXE stands for Preboot eXecution Environment. Developed by Intel in 1999. The NIC firmware contains a mini DHCP client and TFTP downloader — just enough to bootstrap a full OS installer without any local storage.
Gotcha:
chassis bootdev pxeis a one-shot override by default — it applies to the next boot only, then reverts to the BIOS boot order. If you accidentally addoptions=persistent, the server will PXE boot on every reboot and could re-image itself.
Flashcard Check #3¶
| Question | Answer |
|---|---|
| What does PXE stand for? | Preboot eXecution Environment (pronounced "pixie") |
| What protocol does PXE use to download the bootloader? | TFTP (Trivial File Transfer Protocol) |
What does chassis bootdev pxe do? |
Sets the server to PXE boot on its next reboot only (one-shot override) |
What happens if you add options=persistent to a bootdev command? |
The boot override persists across all reboots until explicitly changed |
Part 13: Wrapping Up the Incident¶
Let's come back to web-prod-07. Here's what you did and what happens next:
- Confirmed the server was up —
power status,chassis status - Found the smoking gun — SEL full of correctable ECC errors from one DIMM
- Identified the exact DIMM —
edac-util+dmidecode= DIMM_A1, slot, serial number - Checked for secondary problems — sensors all nominal, no thermal or power issues
- Filed a ticket — datacenter team will hot-swap the DIMM during the next maintenance window
The server stays in production. The errors are correctable, so data integrity is maintained. But you add monitoring: if CE rate exceeds 10/hour on any DIMM, page the on-call. If an uncorrectable error appears, evacuate workload immediately.
# Quick monitoring check you can add to cron
edac-util -s | awk '/Correctable/ && $NF > 0 {print "WARNING: " $0}'
Exercises¶
Exercise 1: Read the room (Quick win — 2 minutes)¶
If you have access to any Linux machine (bare metal, VM, or even WSL):
lshw -short 2>/dev/null || echo "Install with: apt install lshw / dnf install lshw"
dmidecode -t system 2>/dev/null || echo "Requires root: sudo dmidecode -t system"
What manufacturer and model are you running on? If it's a VM, what does systemd-detect-virt
say?
Exercise 2: Check your disks (5 minutes)¶
# Install smartmontools if needed: apt install smartmontools / dnf install smartmontools
sudo smartctl -H /dev/sda # or /dev/nvme0
sudo smartctl -A /dev/sda | grep -E "Reallocated|Pending|Uncorrect"
Are any of the R-P-U trio non-zero?
What if I'm on a VM?
Virtual disks don't have real SMART data. You'll see `SMART support is: Unavailable` or the emulated controller won't report real attributes. This exercise is most useful on bare-metal or machines with physical drives passed through.Exercise 3: Build the triage ladder (10 minutes)¶
You get a page: "server X is unreachable." You have BMC access. Write out the triage steps in order, including the exact commands. Don't look back at the lesson — write from memory, then compare.
Suggested answer
1. ipmitool power status → Is it on?
2. ipmitool chassis status → Any fault flags (fan, power, drive)?
3. ipmitool sel elist | tail -20 → Recent hardware events?
4. ipmitool sdr type Temperature → Thermal issue?
5. ipmitool sdr type Fan → Dead fan?
6. ipmitool sdr type "Power Supply" → PSU failure?
7. ipmitool sol activate → What's on the console?
Exercise 4: Redfish exploration (15 minutes)¶
If you have a Redfish-capable BMC (iDRAC9, iLO5, or newer):
BMC=<your-bmc-ip>
CREDS="admin:password"
# 1. Discover the service root
curl -sk -u $CREDS https://$BMC/redfish/v1/ | jq .
# 2. Find the system URI dynamically
curl -sk -u $CREDS https://$BMC/redfish/v1/Systems | jq '.Members[]."@odata.id"'
# 3. Pull system health
curl -sk -u $CREDS https://$BMC/redfish/v1/Systems/<your-system-id> \
| jq '{Model, Health: .Status.Health, PowerState}'
# 4. Check for any non-OK SEL events
curl -sk -u $CREDS \
https://$BMC/redfish/v1/Managers/<your-manager-id>/LogServices/Sel/Entries \
| jq '[.Members[] | select(.Severity != "OK")] | length'
No BMC available?
The DMTF maintains a Redfish mockup server for testing. It won't have real sensor data, but the API structure is identical. Check the DMTF's [Redfish Mockup Creator](https://github.com/DMTF/Redfish-Mockup-Server) on GitHub.Cheat Sheet¶
IPMI Quick Reference¶
| Task | Command |
|---|---|
| Power status | ipmitool -I lanplus -H $BMC -U $USER -E power status |
| Graceful shutdown | ipmitool ... power soft |
| Hard power cycle | ipmitool ... power cycle |
| All sensors | ipmitool ... sdr list |
| Temperature only | ipmitool ... sdr type Temperature |
| Fan status | ipmitool ... sdr type Fan |
| Event log | ipmitool ... sel elist |
| SEL capacity | ipmitool ... sel info |
| Clear SEL | ipmitool ... sel clear (archive first!) |
| Serial console | ipmitool ... sol activate (disconnect: ~.) |
| PXE boot next | ipmitool ... chassis bootdev pxe |
| Blink chassis LED | ipmitool ... chassis identify 30 |
| BMC info | ipmitool ... mc info |
| BMC reset | ipmitool ... mc reset cold |
| Power draw | ipmitool ... dcmi power reading |
Hardware Diagnostics Quick Reference¶
| Task | Command |
|---|---|
| Hardware inventory | lshw -short |
| System serial number | dmidecode -s system-serial-number |
| Memory DIMMs | dmidecode -t memory |
| ECC errors | edac-util -s / edac-util -l |
| Disk health | smartctl -H /dev/sda |
| SMART death trio | smartctl -A /dev/sda \| grep -E "Reallocated\|Pending\|Uncorrect" |
| NVMe health | nvme smart-log /dev/nvme0 |
| NIC errors | ethtool -S eth0 \| grep -E "error\|drop\|crc" |
| Kernel hw errors | dmesg -T -l err,crit |
| NUMA topology | numactl --hardware |
| CPU throttling | cpupower frequency-info |
| RAID status | storcli /c0/vall show or cat /proc/mdstat |
Redfish Patterns¶
# Setup
BMC=10.0.10.5; CREDS="admin:pass"
# Discover
curl -sk -u $CREDS https://$BMC/redfish/v1/Systems | jq '.Members[]."@odata.id"'
# Health
curl -sk -u $CREDS https://$BMC/redfish/v1/Systems/<id> | jq '.Status.Health'
# Power cycle
curl -sk -u $CREDS -X POST \
https://$BMC/redfish/v1/Systems/<id>/Actions/ComputerSystem.Reset \
-H 'Content-Type: application/json' -d '{"ResetType": "ForceRestart"}'
# SEL (non-OK only)
curl -sk -u $CREDS https://$BMC/.../LogServices/Sel/Entries \
| jq '[.Members[] | select(.Severity != "OK")]'
Takeaways¶
-
Every server is two computers. The BMC is always on, always reachable, and can tell you what the OS cannot. Cable it, configure it, monitor it.
-
The SEL is your black box. When a server misbehaves,
ipmitool sel elistis your first command — before SSH, before logs, before blaming the application. -
Correctable ECC errors are canaries, not noise. A burst of CEs predicts an uncorrectable error (kernel panic). Monitor rate, not just count.
-
SMART R-P-U: the three horsemen. Reallocated, Pending, Uncorrectable — any non-zero value means the disk is actively dying.
-
Redfish is the future; IPMI persists. Use Redfish for automation and fleet management. Use
ipmitoolfor SOL and legacy hardware. Know both. -
Check the hardware before blaming the software. A $40 DIMM replacement can save $15,000 in debugging time. The physical layer is always a suspect.
Related Lessons¶
- What Happens When You Press Power — traces the full boot sequence from power button to login prompt
- PXE Boot: From Network to Running Server — deep dive on the network bootstrap process
- RAID: Why Your Disks Will Fail — disk redundancy, rebuild strategies, when RAID isn't enough
- The Disk That Filled Up — storage diagnostics from a different angle
- The Backup Nobody Tested — why RAID is not a backup, and what is