Server Hardware: When the Blinky Lights Matter

lesson
server-hardware
ipmi/bmc
redfish-api
smart-monitoring
ecc-memory
raid
thermal-management
power-supplies
hardware-diagnostics
pxe-boot
server-lifecycle ---# Server Hardware — When the Blinky Lights Matter

Topics: server hardware, IPMI/BMC, Redfish API, SMART monitoring, ECC memory, RAID, thermal management, power supplies, hardware diagnostics, PXE boot, server lifecycle Level: L1–L2 (Foundations to Operations) Time: 75–90 minutes Prerequisites: None (everything is explained from scratch)

The Mission¶

It's 11:47pm on a Tuesday. Your pager fires: web-prod-07 is intermittently dropping connections. Customers are seeing timeouts. The application team says "the code hasn't changed." The network team says "our switches are fine." It's your server.

You're at home. The server is in a datacenter 300 miles away. Nobody is on-site. You have one tool that still works when everything else is broken: the BMC — a tiny computer living inside the server, always awake, always listening, connected to its own management network.

You're going to diagnose this remotely. Along the way, you'll learn what's actually inside a server, how the components talk to each other, how to interrogate them from your couch, and why a single bad memory stick can ruin your week.

Part 1: Two Computers, One Chassis¶

Before you can diagnose anything, you need to understand the most important fact about enterprise servers that nobody explains clearly: every server is actually two computers.

┌─────────────────────────────────────────────────┐
│  The Server You Know                            │
│  ┌───────┐  ┌───────┐  ┌──────┐  ┌──────┐      │
│  │ CPU   │  │ RAM   │  │ NIC  │  │ Disk │      │
│  └───┬───┘  └───┬───┘  └──┬───┘  └──┬───┘      │
│      └──────────┴─────────┴─────────┘           │
│                     │                           │
│  ┌──────────────────┴────────────────────────┐  │
│  │  BMC (Baseboard Management Controller)    │  │
│  │  - Its own ARM CPU                        │  │
│  │  - Its own RAM (256MB–1GB)                │  │
│  │  - Its own NIC (dedicated or shared)      │  │
│  │  - Its own flash storage                  │  │
│  │  - Always on — even when the server is    │  │
│  │    "off" — as long as AC power is present │  │
│  └───────────────────────────────────────────┘  │
└─────────────────────────────────────────────────┘

The BMC is powered by the 5V standby rail from the power supply. Plug a server into a wall outlet and the BMC boots — even if you never press the power button. It has its own IP address, its own login, and it can see everything the main system cannot report about itself.

Name Origin: BMC stands for Baseboard Management Controller. "Baseboard" is an old term for the motherboard — the base board that everything plugs into. The BMC is a controller chip soldered directly onto it. Every vendor wraps their BMC in a branded name: Dell calls theirs iDRAC (Integrated Dell Remote Access Controller), HPE calls theirs iLO (Integrated Lights-Out — because you can manage the server with the datacenter lights off), Supermicro just calls it "IPMI BMC." They all speak the same protocol underneath.

Name Origin: IPMI stands for Intelligent Platform Management Interface. Intel published the first spec in 1998. "Intelligent" because the BMC can make decisions autonomously (throttle fans, log events, even shut down the server to prevent damage). "Platform Management" because it manages the physical platform, not the software running on it.

The protocol you'll use tonight¶

IPMI messages travel over RMCP+ (Remote Management Control Protocol Plus) on UDP port 623. The + means encryption — the original RMCP had none. The CLI tool is ipmitool, and it works with every vendor's BMC.

# The pattern you'll type a hundred times in your career:
ipmitool -I lanplus -H <bmc-ip> -U <user> -P <password> <command>

Flag	Meaning
`-I lanplus`	Use IPMI 2.0 with encryption (always use this, never `-I lan`)
`-H`	BMC IP address (on the management network, not the server's production IP)
`-U`	Username (default: `root` on Dell, `ADMIN` on Supermicro, `Administrator` on HPE)
`-P`	Password (default: `calvin` on Dell — yes, really)

Gotcha: The -P flag puts your password in the process list, visible to anyone who runs ps aux. For scripts, use -E (reads from the IPMI_PASSWORD environment variable) or -f (reads from a file). For tonight's emergency, -P is fine — you're the only one on this laptop.

Part 2: First Contact — Is This Server Even Alive?¶

You pull up a terminal and type:

BMC=10.0.10.47
ipmitool -I lanplus -H $BMC -U admin -P $PASS power status
# Chassis Power is on

OK. The server is on. That rules out a power failure. Let's get a quick health summary.

ipmitool -I lanplus -H $BMC -U admin -P $PASS chassis status
# System Power         : on
# Power Overload       : false
# Main Power Fault     : false
# Cooling/Fan Fault    : false
# Drive Fault          : false
# Front-Panel Lockout  : inactive
# Chassis Intrusion    : inactive

No screaming red flags. But chassis status is a surface check — it shows what the BMC considers a current fault, not what happened twenty minutes ago.

For that, you need the event log.

Part 3: Reading the Black Box — The System Event Log¶

Every BMC maintains a System Event Log (SEL) — a circular buffer of hardware events stored in non-volatile flash. Temperature spikes, fan failures, power supply glitches, ECC memory errors — all recorded here, even events the OS never saw.

ipmitool -I lanplus -H $BMC -U admin -P $PASS sel elist last 20

You stare at the output:

   1 | 03/18/2026 | 23:15:01 | Memory #0x20 | Correctable ECC | Asserted
   2 | 03/18/2026 | 23:15:01 | Memory #0x20 | Correctable ECC | Asserted
   3 | 03/18/2026 | 23:15:02 | Memory #0x20 | Correctable ECC | Asserted
   4 | 03/18/2026 | 23:15:02 | Memory #0x20 | Correctable ECC | Asserted
   5 | 03/18/2026 | 23:15:03 | Memory #0x20 | Correctable ECC | Asserted
   ...
  18 | 03/18/2026 | 23:31:17 | Memory #0x20 | Correctable ECC | Asserted
  19 | 03/18/2026 | 23:31:17 | Memory #0x20 | Correctable ECC | Asserted
  20 | 03/18/2026 | 23:42:44 | Memory #0x20 | Correctable ECC | Asserted

Twenty events in 27 minutes. All correctable ECC errors. All from the same memory sensor.

Your heart rate picks up. You know what this means.

Under the Hood: ECC stands for Error-Correcting Code. Server RAM uses ECC — each 64-bit word has 8 extra bits that form a Hamming code, allowing the memory controller to detect and correct single-bit errors on the fly, and detect (but not correct) double-bit errors. A correctable error (CE) is a single-bit flip that was silently fixed. Your application never noticed. But a correctable error is a canary — the DIMM is degrading. Eventually, a double-bit flip happens. That's an uncorrectable error (UE), and it triggers a Machine Check Exception (MCE) — an immediate kernel panic.

Trivia: Google published a landmark study in 2009 on DRAM errors across their fleet. They found that about 8% of DIMMs experience at least one correctable error per year. Servers with one correctable error were 13–228 times more likely to see another one. Memory errors are not rare events — they're a fact of life at scale.

How many errors are too many?¶

A handful of CEs per year per DIMM is normal — cosmic rays, voltage transients, the universe being the universe. But a burst of CEs from a single DIMM — dozens in minutes — means the silicon is failing. Dell's guideline: 24 or more correctable errors in 24 hours from the same DIMM = schedule replacement.

You have 20 in 27 minutes. This DIMM is dying.

Flashcard Check #1¶

Question	Answer
What does ECC stand for?	Error-Correcting Code
What's the difference between a correctable (CE) and uncorrectable (UE) error?	CE = single-bit flip, silently fixed. UE = multi-bit flip, triggers MCE/kernel panic.
Where do you find memory error events on a server?	`ipmitool sel elist` (BMC event log) and `edac-util -s` (OS-level)
Why does a burst of CEs matter if each one is "corrected"?	It signals the DIMM is degrading. An uncorrectable error (kernel panic) is likely imminent.

Part 4: Finding the Bad DIMM¶

You know something in memory is failing. Now you need to know which stick, so you can tell the datacenter team exactly what to replace.

From the OS side (if you can still SSH in)¶

# EDAC — Error Detection And Correction subsystem in the Linux kernel
edac-util -s
# mc0: 0 Uncorrectable Errors, 47 Correctable Errors

edac-util -l
# mc0: csrow0: ch0: 47 Correctable Errors    ← all from one channel
# mc0: csrow0: ch1: 0 Correctable Errors
# mc0: csrow1: ch0: 0 Correctable Errors
# mc0: csrow1: ch1: 0 Correctable Errors

mc0: csrow0: ch0 — memory controller 0, chip-select row 0, channel 0. But which physical slot is that?

# Cross-reference with dmidecode
dmidecode -t memory | grep -A 10 "Locator: DIMM_A1"
# Locator: DIMM_A1
# Bank Locator: P0_Node0_Channel0_Dimm0
# Type: DDR4
# Size: 32 GB
# Speed: 3200 MT/s
# Manufacturer: Samsung
# Serial Number: 3A2B4C5D
# Part Number: M393A4K40DB3-CWE

Under the Hood: dmidecode reads the SMBIOS (System Management BIOS) tables — a data structure in firmware that describes every physical component. The -t memory flag pulls the type 17 (Memory Device) records, which include the physical slot label, capacity, speed, manufacturer, and serial number. This is how you go from "channel 0, row 0" to "the Samsung DIMM in slot A1."

Now you can email the datacenter team: "Replace the 32GB DDR4 in slot DIMM_A1, serial 3A2B4C5D, on web-prod-07 in rack E-42, unit 15."

But why is it dropping connections?¶

A correctable ECC error is fixed transparently — the application should not notice. So why are connections dropping?

Here's the thing: correcting an ECC error takes time. Not much — a few hundred nanoseconds. But if errors are bursting at a high rate, the memory controller spends measurable time on corrections. The CPU stalls waiting for corrected data. Under heavy load, these micro-stalls accumulate. Network packets sit in receive buffers too long. TCP retransmits. Applications see timeouts.

It's not a crash. It's death by a thousand paper cuts.

War Story: A team at a major hosting provider spent three days debugging intermittent latency spikes on a cluster of database servers. Application profiling showed nothing. Network traces showed retransmits but no packet loss at the switch. They finally checked edac-util and found one server generating 500+ correctable ECC errors per hour from a single DIMM. The memory controller's correction overhead was adding 2–5ms of jitter to every memory-intensive operation. Replacing a $40 DIMM fixed a problem that had consumed $15,000 in engineering time. The lesson: check the hardware before blaming the software. The SEL would have shown this on day one.

Part 5: What Else Could Be Wrong? — The Full Sensor Sweep¶

You've found the smoking gun, but good triage means checking everything. A DIMM failure could be a symptom of a deeper problem — overheating, a bad power rail, a failing motherboard.

# Full sensor dump
ipmitool -I lanplus -H $BMC -U admin -P $PASS sdr list
# Inlet Temp       | 24 degrees C      | ok
# Exhaust Temp     | 38 degrees C      | ok
# CPU1 Temp        | 62 degrees C      | ok
# CPU2 Temp        | 59 degrees C      | ok
# Fan1             | 8400 RPM          | ok
# Fan2             | 8520 RPM          | ok
# Fan3             | 8280 RPM          | ok
# Fan4             | 8640 RPM          | ok
# PSU1 Status      | 0x01              | ok
# PSU2 Status      | 0x01              | ok
# DIMM PG          | 0x01              | ok
# VCORE            | 0.88 Volts        | ok
# 12V Rail         | 12.13 Volts       | ok
# 3.3V Rail        | 3.32 Volts        | ok

Everything looks normal. Temps are fine, fans are spinning, both PSUs are healthy, voltage rails are within spec. This isn't a thermal or power problem — it's an isolated DIMM failure.

Sensor types at a glance¶

Type	What It Monitors	"Oh no" Reading
Temperature	CPU, inlet air, exhaust, DIMMs	Inlet > 35C, CPU > 90C
Fan	Fan RPM	0 RPM = dead fan
Voltage	CPU Vcore, 3.3V, 5V, 12V rails	Outside +/- 5% of nominal
Power Supply	PSU presence and health	Status = critical or absent
Memory	ECC errors, DIMM presence	Correctable errors spiking
Drive	Disk fault indicator	Any fault = imminent failure

Zooming in on a single sensor¶

ipmitool -I lanplus -H $BMC -U admin -P $PASS sensor get "Inlet Temp"
# Sensor ID              : Inlet Temp (0x4)
# Sensor Type (Analog)   : Temperature
# Sensor Reading          : 24 (+/- 0) degrees C
# Status                  : ok
# Upper Non-Critical      : 42.000
# Upper Critical          : 47.000

Those thresholds matter. When the reading crosses "Upper Non-Critical" (42C), the BMC logs a warning. Cross "Upper Critical" (47C) and it logs a critical alert — and may start throttling CPUs or ramping fans to maximum.

Part 6: The Modern Way — Redfish API¶

Everything you just did with ipmitool — you can also do with curl. Redfish is the modern REST-based replacement for IPMI, and every server shipped since ~2017 speaks it.

Name Origin: "Redfish" was deliberately chosen by the DMTF (Distributed Management Task Force) as an approachable name — a break from the alphabet soup of IPMI, SMASH, and WS-Management. The first spec was published in August 2015.

Why Redfish over IPMI?¶

	IPMI	Redfish
Transport	UDP 623, binary protocol	HTTPS (TCP 443), JSON
Auth	RAKP — leaks password hashes by design	Basic Auth + Session Tokens + TLS
Extensibility	Vendor OEM commands (undocumented blobs)	JSON with schemas, OEM extensions are readable
Scriptability	Parse `ipmitool` text output (fragile)	`curl` + `jq` (rock solid)

Let's redo the diagnosis with curl¶

BMC=10.0.10.47
CREDS="admin:password"

# Power state
curl -sk -u $CREDS https://$BMC/redfish/v1/Systems/System.Embedded.1 \
  | jq '.PowerState'
# "On"

# System health summary
curl -sk -u $CREDS https://$BMC/redfish/v1/Systems/System.Embedded.1 \
  | jq '{
    Model, SerialNumber, PowerState,
    Health: .Status.Health,
    CPUs: .ProcessorSummary.Count,
    RAM_GB: .MemorySummary.TotalSystemMemoryGiB
  }'
# {
#   "Model": "PowerEdge R750",
#   "SerialNumber": "ABC1234",
#   "PowerState": "On",
#   "Health": "Warning",
#   "CPUs": 2,
#   "RAM_GB": 256
# }

"Health: Warning" — Redfish is already telling you something is wrong. Let's pull the SEL:

# Recent SEL events that aren't OK
curl -sk -u $CREDS \
  https://$BMC/redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Sel/Entries \
  | jq '[.Members[] | select(.Severity != "OK")]
        | sort_by(.Created) | reverse | .[:5]
        | .[] | {Created, Message, Severity}'

{
  "Created": "2026-03-18T23:42:44+00:00",
  "Message": "A correctable memory error was detected on DIMM_A1.",
  "Severity": "Warning"
}

There it is. Same story, but now you get a DIMM slot name directly in the JSON — no cross-referencing with dmidecode needed.

Temperature and power via Redfish¶

# Thermal health
curl -sk -u $CREDS \
  https://$BMC/redfish/v1/Chassis/System.Embedded.1/Thermal \
  | jq '[.Temperatures[] | {Name, ReadingCelsius, Health: .Status.Health}]'

# Power consumption
curl -sk -u $CREDS \
  https://$BMC/redfish/v1/Chassis/System.Embedded.1/Power \
  | jq '{
    Watts: .PowerControl[0].PowerConsumedWatts,
    PSUs: [.PowerSupplies[] | {Name, Health: .Status.Health}]
  }'

Gotcha: Redfish URIs differ by vendor. Dell uses /redfish/v1/Systems/System.Embedded.1. HPE uses /redfish/v1/Systems/1. Never hardcode URIs in automation — always discover from the service root:
curl -sk -u $CREDS https://$BMC/redfish/v1/Systems \
  | jq '.Members[0]."@odata.id"'

Flashcard Check #2¶

Question	Answer
What transport does Redfish use?	HTTPS (TCP 443) with JSON payloads
What transport does IPMI use?	RMCP/RMCP+ over UDP 623
Why is IPMI's RAKP authentication dangerous?	It returns a password hash to any unauthenticated client — crackable offline (CVE-2013-4786)
How do you discover the correct System URI in Redfish?	GET `/redfish/v1/Systems` and read `.Members[0]."@odata.id"`

Part 7: The Server's Serial Console — SOL¶

What if you can't SSH in at all? The server is up (BMC says power is on) but the OS is unreachable. You need to see what's on the screen.

Serial-over-LAN (SOL) tunnels the server's serial console through the BMC to your terminal. You see POST messages, GRUB menus, kernel boot output, kernel panics — everything.

ipmitool -I lanplus -H $BMC -U admin -P $PASS sol activate

You're now looking at the server's console. If the OS is at a login prompt, you can type credentials. If the kernel is panicking, you see the panic message. If it's stuck in GRUB, you can choose a different kernel.

To disconnect: type ~. (tilde, then period). If you're connected through SSH, use ~~. to avoid triggering SSH's own escape.

Gotcha: Only one SOL session at a time. If you get "SOL session already active," a stale session exists. Kill it first:
ipmitool -I lanplus -H $BMC -U admin -P $PASS sol deactivate
Then reconnect.

SOL has no Redfish equivalent. This is the main reason ipmitool persists in modern environments — when you need an interactive console, SOL is it.

Part 8: The Physical Layer You Can't Ignore¶

Now that the immediate crisis is under control (DIMM identified, replacement scheduled, monitoring alert set), let's zoom out. What's actually in that server, and how does it all fit together?

Form Factors¶

Form Factor	Height	Typical Use	Trade-offs
1U	1.75 inches	Web servers, compute nodes	Fewer drive bays, limited PCIe, louder fans
2U	3.5 inches	General purpose, storage	Good balance of density and expandability
4U	7 inches	GPU servers, dense storage	Lots of room, lots of power, lots of heat
Blade	Varies	High-density compute	Shared chassis, power, networking — complex

Trivia: A 1U server's fans move 80–120 CFM (cubic feet per minute) of air. A full rack of 42 such servers pushes 3,300–5,000 CFM — roughly equivalent to a whole-house fan. At full speed, server fans produce over 70 dB. Datacenter workers wear hearing protection during extended rack work.

CPU and NUMA — Why Memory Location Matters¶

Modern servers have 1–2 CPU sockets. Each socket has its own local memory bank. This architecture is called NUMA (Non-Uniform Memory Access).

  ┌─────────────────┐     ┌─────────────────┐
  │   CPU Socket 0  │     │   CPU Socket 1  │
  │   16-64 cores   │     │   16-64 cores   │
  └────────┬────────┘     └────────┬────────┘
           │                       │
    ┌──────┴──────┐         ┌──────┴──────┐
    │ Local RAM   │         │ Local RAM   │
    │ ~80ns       │         │ ~80ns       │
    │ (DIMM_A1-A8)│         │ (DIMM_B1-B8)│
    └─────────────┘         └─────────────┘
           │     Interconnect      │
           └───────────────────────┘
                 ~130ns cross-socket

Accessing local memory: ~80 nanoseconds. Accessing the other socket's memory: ~130ns — a 60% penalty. If a database process on CPU 0 accidentally allocates all its memory on CPU 1's bank, every memory access pays that penalty. This is why database tuning guides always mention "NUMA pinning."

# See your NUMA topology
numactl --hardware
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
# node 0 size: 128000 MB
# node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
# node 1 size: 128000 MB
# node distances:
# node   0   1
#   0:  10  21
#   1:  21  10

Interview Bridge: "Explain NUMA and why it matters" is a common infrastructure interview question. The answer: memory access time depends on which CPU socket is accessing which memory bank. Performance-sensitive applications should be pinned to a single NUMA node using numactl --cpunodebind=0 --membind=0 <command>.

Disks — SAS vs SATA vs NVMe¶

	SATA	SAS	NVMe
Interface	SATA bus	SAS bus	PCIe lanes (direct to CPU)
Max throughput	~600 MB/s	~1,200 MB/s (12Gbps SAS)	~7,000+ MB/s
Typical use	Bulk storage, cold data	Enterprise spinning disks, reliable SSDs	Primary storage, databases, anything fast
Hot-swap	Yes (with backplane)	Yes (with backplane)	Yes (U.2/U.3 form factors)
RAID controller needed?	Yes for hardware RAID	Yes for hardware RAID	No — NVMe talks directly to CPU

Under the Hood: NVMe (Non-Volatile Memory Express), introduced in 2011, bypasses the traditional storage controller entirely. SATA and SAS drives talk to a RAID controller, which talks to the CPU over PCIe. NVMe drives connect directly to PCIe lanes, eliminating the controller bottleneck. The SATA-to-NVMe jump is the single largest performance improvement in storage history — bigger than the HDD-to-SSD transition.

SMART — Your Disks Are Talking to You¶

Every disk tracks its own health using SMART (Self-Monitoring, Analysis and Reporting Technology). Three attributes predict failure:

smartctl -A /dev/sda | grep -E "Reallocated|Pending|Uncorrect"
# 5   Reallocated_Sector_Ct   0       ← bad sectors remapped (>0 = warning)
# 197 Current_Pending_Sector  0       ← sectors awaiting remap (>0 = warning)
# 198 Offline_Uncorrectable   0       ← unrecoverable read errors (>0 = replace now)

Remember: The SMART death trio: R-P-U. Reallocated (bad sectors already moved to spare area), Pending (sectors waiting for a retry or remap), Uncorrectable (sectors that failed and can't be fixed). Any non-zero on these three = start planning a replacement.

# NVMe drives have their own health metric
nvme smart-log /dev/nvme0
# critical_warning                        : 0
# temperature                             : 35 C
# percentage_used                         : 3%     ← 100% = rated endurance reached
# media_errors                            : 0

Trivia: Google's 2007 study "Failure Trends in a Large Disk Drive Population" found that 36% of failed drives had zero SMART warnings beforehand. But drives with even one reallocated sector were 14x more likely to fail within 60 days. SMART isn't a crystal ball — but when it does warn you, listen.

RAID — Because Disks Will Fail¶

Level	Minimum Disks	Survives	Use Case
RAID 0	2	0 failures (striping only)	Temp data, scratch
RAID 1	2	1 failure (mirror)	OS boot drives
RAID 5	3	1 failure (striping + parity)	Read-heavy general purpose
RAID 6	4	2 failures (double parity)	Large arrays, safety margin
RAID 10	4	1 per mirror pair	Databases, write-heavy

# Check RAID status (MegaRAID / Dell PERC)
storcli /c0/vall show
# DG/VD TYPE  State   Access Consist Cache  Size
# 0/0   RAID1 Optimal RW     Yes     RWBD   446.625 GB

# Software RAID
cat /proc/mdstat
# md0 : active raid1 sda1[0] sdb1[1]
#       1953513472 blocks [2/2] [UU]     ← UU = both drives healthy

Gotcha: RAID is not a backup. A deleted file on RAID 10 is deleted across all mirrors simultaneously. RAID protects against hardware disk failure. It does NOT protect against accidental deletion, corruption, ransomware, or controller failure.

Part 9: Power — The Boring Thing That Ruins Everything¶

Redundant Power Supplies¶

Enterprise servers have two PSUs connected to separate PDUs (Power Distribution Units) fed by separate circuits — the A-feed and B-feed. If one circuit trips, the other keeps the server running.

# Check PSU status via IPMI
ipmitool -I lanplus -H $BMC -U admin -P $PASS sdr type "Power Supply"
# PSU1 Status      | 0x01              | ok
# PSU2 Status      | 0x01              | ok

# Power consumption
ipmitool -I lanplus -H $BMC -U admin -P $PASS dcmi power reading
# Instantaneous power reading:   287 Watts
# Minimum during sampling period: 245 Watts
# Maximum during sampling period: 342 Watts
# Average power reading:          278 Watts

Mental Model: Think of redundant PSUs like a two-lane bridge. Traffic flows through both lanes normally. If one lane closes, all traffic can still cross on the remaining lane — as long as the remaining lane has enough capacity. A 750W PSU in a server drawing 287W has plenty of headroom. A 500W PSU in a server drawing 450W at peak... doesn't.

Trivia: Server PSUs are rated by 80 Plus efficiency certification: Bronze (82–85%), Gold (87–90%), Platinum (90–94%), Titanium (91–96%). The difference between basic 80 Plus and Titanium at 50% load is about 16%. Across thousands of servers, that's millions of dollars per year in electricity.

What the Power Restore Policy means¶

ipmitool -I lanplus -H $BMC -U admin -P $PASS chassis status | grep "Power Restore"
# Power Restore Policy : always-on

This determines what happens after a power outage: - always-on — server auto-boots when power returns (datacenter default) - previous — returns to whatever state it was in before the outage - always-off — stays off until someone manually powers it on

Part 10: Hardware Inventory — dmidecode and lshw¶

Two tools you'll reach for every time you need to know what's physically in a server.

dmidecode — Reads BIOS/UEFI firmware tables¶

# System identification (what is this machine?)
dmidecode -t system
# Manufacturer: Dell Inc.
# Product Name: PowerEdge R750
# Serial Number: ABC1234

# All memory DIMMs
dmidecode -t memory | grep -E "Locator:|Size:|Speed:|Manufacturer:" | head -16
# Locator: DIMM_A1
# Size: 32 GB
# Speed: 3200 MT/s
# Manufacturer: Samsung
# Locator: DIMM_A2
# Size: 32 GB
# ...

# BIOS version (critical for firmware compliance)
dmidecode -t bios
# Version: 2.19.1
# Release Date: 01/15/2026

# One-liner: get the serial number
dmidecode -s system-serial-number
# ABC1234

lshw — Hardware lister¶

# Concise hardware summary
lshw -short
# H/W path        Device      Class       Description
# /0/0                        memory      256GiB System Memory
# /0/0/0                      memory      32GiB DIMM DDR4 3200 MHz
# /0/100/1f.2                 storage     SATA Controller
# /0/100/3/0       eno1       network     Ethernet Controller X710

# Filter by class
lshw -class network
lshw -class disk
lshw -class memory

Gotcha: dmidecode requires root and reads firmware tables. In virtual machines, it reports the hypervisor's emulated hardware, not physical hardware. Run systemd-detect-virt first to check if you're on bare metal before trusting dmidecode for inventory.

Part 11: Thermal Management — When the Data Center Gets Hot¶

What if the problem had been thermal? Here's how you'd know.

# CPU temperatures via lm-sensors
sensors
# coretemp-isa-0000
# Core 0:        +62.0°C  (high = +85.0°C, crit = +100.0°C)
# Core 1:        +60.0°C

# IPMI inlet and exhaust temps
ipmitool sensor list | grep -i temp
# Inlet Temp       | 24.000     | degrees C  | ok
# Exhaust Temp     | 38.000     | degrees C  | ok
# CPU1 Temp        | 62.000     | degrees C  | ok

Thermal throttling is when the CPU reduces its clock speed to prevent overheating. Your 3.0 GHz processor suddenly runs at 1.2 GHz. Applications slow to a crawl.

Common causes: 1. Dead fan — ipmitool sensor list | grep -i fan shows 0 RPM 2. Blocked airflow — missing blanking panels in the rack, cables obstructing the intake 3. CRAC failure — the Computer Room Air Conditioner serving your row is down 4. Dust buildup — especially on air filters and heat sinks

# Check if CPU is actually throttling
cat /proc/cpuinfo | grep -i mhz | head -2
# cpu MHz         : 1200.000    ← should be ~3000 if not throttled

# Check frequency governor
cpupower frequency-info
# current CPU frequency: 1.20 GHz    ← throttled

Gotcha: Many Linux distributions ship with the powersave CPU governor, which intentionally runs CPUs at low frequency. On a production server, this adds latency. Set performance mode: cpupower frequency-set -g performance. Make it permanent with tuned-adm profile throughput-performance.

Part 12: The Server Lifecycle¶

A server doesn't just appear in a rack. Here's the full journey from purchase order to recycling bin.

Procurement → Receive & Asset Tag → Rack & Cable → BIOS/BMC Config
     → PXE Boot → OS Install → Configuration Management → Burn-in Test
          → Production → Monitoring → Maintenance Windows
               → Decommission → Data Wipe → Disposal/Recycle

PXE Boot — The Network Bootstrap¶

When a server first boots, it has no operating system. PXE (Preboot eXecution Environment, pronounced "pixie") lets the NIC firmware download a bootloader over the network:

Server powers on → NIC PXE ROM → DHCP request → DHCP server replies with:
  - IP address
  - TFTP server address
  - Boot filename (pxelinux.0 or GRUB)
→ NIC downloads bootloader via TFTP → Bootloader downloads kernel + initramfs
→ Kernel boots → Installer runs → Kickstart/Preseed automates answers
→ OS installed

You can trigger PXE boot remotely:

# Via IPMI
ipmitool -I lanplus -H $BMC -U admin -P $PASS chassis bootdev pxe
ipmitool -I lanplus -H $BMC -U admin -P $PASS power cycle

# Via Redfish
curl -sk -u $CREDS \
  -X PATCH https://$BMC/redfish/v1/Systems/System.Embedded.1 \
  -H 'Content-Type: application/json' \
  -d '{"Boot": {"BootSourceOverrideTarget": "Pxe", "BootSourceOverrideEnabled": "Once"}}'

Name Origin: PXE stands for Preboot eXecution Environment. Developed by Intel in 1999. The NIC firmware contains a mini DHCP client and TFTP downloader — just enough to bootstrap a full OS installer without any local storage.

Gotcha: chassis bootdev pxe is a one-shot override by default — it applies to the next boot only, then reverts to the BIOS boot order. If you accidentally add options=persistent, the server will PXE boot on every reboot and could re-image itself.

Flashcard Check #3¶

Question	Answer
What does PXE stand for?	Preboot eXecution Environment (pronounced "pixie")
What protocol does PXE use to download the bootloader?	TFTP (Trivial File Transfer Protocol)
What does `chassis bootdev pxe` do?	Sets the server to PXE boot on its next reboot only (one-shot override)
What happens if you add `options=persistent` to a bootdev command?	The boot override persists across all reboots until explicitly changed

Part 13: Wrapping Up the Incident¶

Let's come back to web-prod-07. Here's what you did and what happens next:

Confirmed the server was up — power status, chassis status
Found the smoking gun — SEL full of correctable ECC errors from one DIMM
Identified the exact DIMM — edac-util + dmidecode = DIMM_A1, slot, serial number
Checked for secondary problems — sensors all nominal, no thermal or power issues
Filed a ticket — datacenter team will hot-swap the DIMM during the next maintenance window

The server stays in production. The errors are correctable, so data integrity is maintained. But you add monitoring: if CE rate exceeds 10/hour on any DIMM, page the on-call. If an uncorrectable error appears, evacuate workload immediately.

# Quick monitoring check you can add to cron
edac-util -s | awk '/Correctable/ && $NF > 0 {print "WARNING: " $0}'

Exercises¶

Exercise 1: Read the room (Quick win — 2 minutes)¶

If you have access to any Linux machine (bare metal, VM, or even WSL):

lshw -short 2>/dev/null || echo "Install with: apt install lshw / dnf install lshw"
dmidecode -t system 2>/dev/null || echo "Requires root: sudo dmidecode -t system"

What manufacturer and model are you running on? If it's a VM, what does systemd-detect-virt say?

Exercise 2: Check your disks (5 minutes)¶

# Install smartmontools if needed: apt install smartmontools / dnf install smartmontools
sudo smartctl -H /dev/sda    # or /dev/nvme0
sudo smartctl -A /dev/sda | grep -E "Reallocated|Pending|Uncorrect"

Are any of the R-P-U trio non-zero?

What if I'm on a VM?

Virtual disks don't have real SMART data. You'll see `SMART support is: Unavailable` or the emulated controller won't report real attributes. This exercise is most useful on bare-metal or machines with physical drives passed through.

Exercise 3: Build the triage ladder (10 minutes)¶

You get a page: "server X is unreachable." You have BMC access. Write out the triage steps in order, including the exact commands. Don't look back at the lesson — write from memory, then compare.

Exercise 4: Redfish exploration (15 minutes)¶

If you have a Redfish-capable BMC (iDRAC9, iLO5, or newer):

BMC=<your-bmc-ip>
CREDS="admin:password"

# 1. Discover the service root
curl -sk -u $CREDS https://$BMC/redfish/v1/ | jq .

# 2. Find the system URI dynamically
curl -sk -u $CREDS https://$BMC/redfish/v1/Systems | jq '.Members[]."@odata.id"'

# 3. Pull system health
curl -sk -u $CREDS https://$BMC/redfish/v1/Systems/<your-system-id> \
  | jq '{Model, Health: .Status.Health, PowerState}'

# 4. Check for any non-OK SEL events
curl -sk -u $CREDS \
  https://$BMC/redfish/v1/Managers/<your-manager-id>/LogServices/Sel/Entries \
  | jq '[.Members[] | select(.Severity != "OK")] | length'

No BMC available?

The DMTF maintains a Redfish mockup server for testing. It won't have real sensor data, but the API structure is identical. Check the DMTF's [Redfish Mockup Creator](https://github.com/DMTF/Redfish-Mockup-Server) on GitHub.

Cheat Sheet¶

IPMI Quick Reference¶

Task	Command
Power status	`ipmitool -I lanplus -H $BMC -U $USER -E power status`
Graceful shutdown	`ipmitool ... power soft`
Hard power cycle	`ipmitool ... power cycle`
All sensors	`ipmitool ... sdr list`
Temperature only	`ipmitool ... sdr type Temperature`
Fan status	`ipmitool ... sdr type Fan`
Event log	`ipmitool ... sel elist`
SEL capacity	`ipmitool ... sel info`
Clear SEL	`ipmitool ... sel clear` (archive first!)
Serial console	`ipmitool ... sol activate` (disconnect: `~.`)
PXE boot next	`ipmitool ... chassis bootdev pxe`
Blink chassis LED	`ipmitool ... chassis identify 30`
BMC info	`ipmitool ... mc info`
BMC reset	`ipmitool ... mc reset cold`
Power draw	`ipmitool ... dcmi power reading`

Hardware Diagnostics Quick Reference¶

Task	Command
Hardware inventory	`lshw -short`
System serial number	`dmidecode -s system-serial-number`
Memory DIMMs	`dmidecode -t memory`
ECC errors	`edac-util -s` / `edac-util -l`
Disk health	`smartctl -H /dev/sda`
SMART death trio	`smartctl -A /dev/sda \\| grep -E "Reallocated\\|Pending\\|Uncorrect"`
NVMe health	`nvme smart-log /dev/nvme0`
NIC errors	`ethtool -S eth0 \\| grep -E "error\\|drop\\|crc"`
Kernel hw errors	`dmesg -T -l err,crit`
NUMA topology	`numactl --hardware`
CPU throttling	`cpupower frequency-info`
RAID status	`storcli /c0/vall show` or `cat /proc/mdstat`

Redfish Patterns¶

# Setup
BMC=10.0.10.5; CREDS="admin:pass"

# Discover
curl -sk -u $CREDS https://$BMC/redfish/v1/Systems | jq '.Members[]."@odata.id"'

# Health
curl -sk -u $CREDS https://$BMC/redfish/v1/Systems/<id> | jq '.Status.Health'

# Power cycle
curl -sk -u $CREDS -X POST \
  https://$BMC/redfish/v1/Systems/<id>/Actions/ComputerSystem.Reset \
  -H 'Content-Type: application/json' -d '{"ResetType": "ForceRestart"}'

# SEL (non-OK only)
curl -sk -u $CREDS https://$BMC/.../LogServices/Sel/Entries \
  | jq '[.Members[] | select(.Severity != "OK")]'

Takeaways¶

Every server is two computers. The BMC is always on, always reachable, and can tell you what the OS cannot. Cable it, configure it, monitor it.
The SEL is your black box. When a server misbehaves, ipmitool sel elist is your first command — before SSH, before logs, before blaming the application.
Correctable ECC errors are canaries, not noise. A burst of CEs predicts an uncorrectable error (kernel panic). Monitor rate, not just count.
SMART R-P-U: the three horsemen. Reallocated, Pending, Uncorrectable — any non-zero value means the disk is actively dying.
Redfish is the future; IPMI persists. Use Redfish for automation and fleet management. Use ipmitool for SOL and legacy hardware. Know both.
Check the hardware before blaming the software. A $40 DIMM replacement can save $15,000 in debugging time. The physical layer is always a suspect.

What Happens When You Press Power — traces the full boot sequence from power button to login prompt
PXE Boot: From Network to Running Server — deep dive on the network bootstrap process
RAID: Why Your Disks Will Fail — disk redundancy, rebuild strategies, when RAID isn't enough
The Disk That Filled Up — storage diagnostics from a different angle
The Backup Nobody Tested — why RAID is not a backup, and what is