Portal | Level: L1: Foundations | Topics: Rack & Stack, Out-of-Band Management, RAID, Server Hardware | Domain: Datacenter & Hardware
Data Center & Dell Server Management - Skill Check¶
Mental model (bottom-up)¶
A data center is a physical building that provides power, cooling, and network to racks of servers. Everything in Kubernetes ultimately runs on physical hardware managed by out-of-band controllers (iDRAC/BMC). Most outages trace back to physical causes: power, cooling, cabling, or hardware failure.
Visual stack¶
[Kubernetes ] Pods, services, ingress, operators
|
[OS + Config Mgmt ] Linux, Ansible, cloud-init
|
[Out-of-Band Mgmt ] iDRAC / BMC — always-on, independent of OS
|
[Firmware ] BIOS/UEFI, PERC (RAID), NIC, iDRAC firmware
|
[Server Hardware ] CPU, DIMMs, disks (SSD/HDD), PSUs, fans
|
[Rack + Network ] Rails, switches (ToR), patch panels, cables
|
[Power + Cooling ] PDUs, UPS, breakers, CRAC/CRAH, containment
|
[Facility ] Building, utility power, generator, fire suppression
Glossary (jargon explained)¶
- BMC - Baseboard Management Controller; tiny computer on the motherboard for remote management
- iDRAC - Dell's BMC implementation (integrated Dell Remote Access Controller)
- Redfish - Modern REST API standard for server management (replaces IPMI)
- IPMI - Older binary protocol for BMC communication (UDP port 623)
- RACADM - Dell's CLI tool for iDRAC management
- SEL - System Event Log; hardware event history stored in BMC non-volatile memory
- SCP - Server Configuration Profile; XML/JSON export of entire server config
- PERC - PowerEdge RAID Controller; Dell's hardware RAID controller
- OME - OpenManage Enterprise; Dell's fleet management console
- PXE - Preboot Execution Environment; network boot protocol (DHCP + TFTP)
- PDU - Power Distribution Unit; distributes power to servers in a rack
- UPS - Uninterruptible Power Supply; battery backup for clean shutdown / generator start
- CRAC/CRAH - Computer Room Air Conditioner/Handler; data center cooling units
- ToR - Top-of-Rack; switch placement at the top of each rack
- EoR - End-of-Row; centralized switch placement at the row end
- DAC - Direct Attach Copper; short-range cable for server-to-switch (1-7m)
- ASHRAE - Sets temperature/humidity guidelines for data centers
- CMDB - Configuration Management Database; asset/relationship tracking system
- ECC - Error-Correcting Code memory; detects and corrects single-bit errors
- NIST 800-88 - Standard for media sanitization (disk wipe procedures)
Core internals you should actually know¶
- iDRAC is always on — it has its own power, CPU, and NIC. It works even when the server OS is down or the server won't POST.
- Redfish is the modern API — HTTPS + JSON, self-describing. IPMI is legacy (binary, weak security). Default to Redfish for new automation.
- Dual power feeds (A+B) — each PSU connects to a different PDU on a different circuit. Losing one feed should not take down a server.
- Hot aisle / cold aisle — servers intake cool air from the front (cold aisle) and exhaust hot air from the rear (hot aisle). Blanking panels prevent hot air recirculation.
- SEL is the hardware black box — always check it first for hardware issues. It records events even across reboots and OS reinstalls.
Common failure modes (ops reality)¶
- Blinking amber LED with server down — usually PSU, memory, or CPU fault. Check SEL first.
- "Disk predicted failure" — SMART / PERC flagged it. Replace proactively; if there's a hot spare, rebuild starts automatically.
- ECC errors climbing — DIMM degrading. Correctable errors are fine temporarily, but plan replacement before it becomes uncorrectable.
- Thermal shutdown — check blanking panels, CRAC/CRAH status, blocked floor tiles.
- PXE boot fails — check DHCP (is the server getting an IP?), TFTP (is the bootloader serving?), boot order (is PXE first in BIOS?).
Phase 1: Dell Server Management (easy -> hard)¶
- What is iDRAC and why does it matter?
-
It's Dell's BMC — an always-on management controller with its own NIC, independent of the host OS. It lets you remotely manage power, console, firmware, and sensors even when the server is powered off or crashed.
-
How do you check a server's hardware health remotely?
-
Read the System Event Log (SEL) via iDRAC:
racadm getselor the Redfish LogServices endpoint. Look for non-OK severity entries. -
What's the difference between RACADM and the Redfish API?
-
RACADM is Dell's proprietary CLI (SSH or local). Redfish is the industry-standard REST API (HTTPS + JSON). Both configure iDRAC; Redfish is preferred for automation because it's scriptable with any HTTP client.
-
What is a Server Configuration Profile (SCP)?
-
An XML or JSON export of the entire server configuration (BIOS, RAID, NIC, iDRAC settings). You can export from one server and import to another for consistent configuration.
-
Your monitoring shows a server's inlet temperature is 35C. What do you check?
-
35C is above ASHRAE A1 recommended range. Check: blanking panels (any gaps?), cold aisle containment (doors closed?), CRAC/CRAH running?, blocked perforated floor tiles?, neighboring rack exhaust leaking? Also check for failed fans in the server itself.
-
What's the difference between UEFI and Legacy BIOS boot?
-
UEFI uses GPT partition tables, supports Secure Boot, and has faster POST. Legacy BIOS uses MBR and is limited to 2TB boot disks. All new deployments should use UEFI.
-
A server has dual PSUs. PSU 1 fails. What happens?
-
The server continues running on PSU 2 (redundant mode). iDRAC logs a critical SEL event and sends an alert. You replace PSU 1 hot-swap without downtime. The key: each PSU should be on a different power feed (A/B).
-
What BIOS settings should you verify for a Kubernetes worker node?
-
UEFI mode, Virtualization (VT-x) enabled, Hyperthreading enabled, NUMA optimization enabled, Performance system profile, Secure Boot enabled if OS supports it, TPM 2.0 enabled, AC Power Recovery set to Last or On.
-
Explain RAID 1, RAID 5, RAID 6, and RAID 10. When would you use each?
-
RAID 1: Mirror, 2 disks, 50% capacity, 1-disk fault tolerance — boot/OS drives. RAID 5: Striped+parity, 3+ disks, (N-1) capacity, 1-disk tolerance — read-heavy general storage. RAID 6: Double parity, 4+ disks, (N-2) capacity, 2-disk tolerance — large arrays. RAID 10: Striped mirrors, 4+ disks, 50% capacity, 1 per mirror — databases, write-heavy.
-
A physical disk in a RAID 5 shows "Predicted Failure". Walk through the replacement procedure.
- 1) Identify the failed disk:
perccli /c0/eall/sall show. 2) Blink its LED:perccli /c0/e252/s2 start locate. 3) Check if a hot spare is configured — if yes, rebuild starts automatically onto the spare. 4) Physically replace the failed disk (hot-swap). 5) If no hot spare, configure the new disk:perccli /c0/e252/s2 add hotsparedrive dgs=0or let it auto-rebuild. 6) Monitor rebuild:perccli /c0/vall show rebuild. Rebuild times: ~2-4 hours for 1TB SAS 10K, 12-24+ hours for 8TB NL-SAS.
Phase 2: Out-of-Band & Provisioning (easy -> hard)¶
- What is IPMI and why is it being replaced by Redfish?
-
IPMI is the legacy protocol for BMC communication: binary, UDP port 623, weak authentication. Redfish is the modern replacement: HTTPS + JSON, TLS, self-describing API. Redfish is scriptable, secure, and human-readable.
-
What is PXE boot and what components are needed?
-
PXE (Preboot Execution Environment) lets a server boot from the network. Requires: 1) DHCP server (gives IP + boot filename), 2) TFTP server (serves bootloader), 3) HTTP server (serves kernel, initrd, kickstart/autoinstall), 4) PXE-capable NIC on the server (usually NIC1).
-
Describe the end-to-end flow from racking a new server to it joining a Kubernetes cluster.
-
1) Rack & cable (power A+B, iDRAC NIC, data NICs). 2) iDRAC gets IP via DHCP or manual config. 3) Automation configures iDRAC (credentials, alerts, BIOS golden config, RAID). 4) Set one-time PXE boot + power cycle via Redfish. 5) PXE -> automated OS install (kickstart/autoinstall). 6) Post-install Ansible: hardening, monitoring agents, containerd, k3s/kubeadm agent join. 7) Node appears in cluster, uncordon. 8) Update CMDB.
-
What is the Ansible
dellemc.openmanagecollection used for? -
It provides Ansible modules for automating iDRAC management: firmware updates, BIOS configuration, system inventory, power operations, SCP import/export. Lets you manage Dell server fleets as code.
-
Compare MAAS, Foreman, and Ironic for bare-metal provisioning.
-
MAAS (Canonical): Best for Ubuntu-first, cloud-like bare-metal model, simple setup, auto-discovery via PXE. Foreman+Katello: Best for RHEL/enterprise, full lifecycle with Puppet/Ansible, content management. Ironic: OpenStack's bare-metal service, best if already running OpenStack, most complex. All three handle IPMI/Redfish power management and PXE provisioning.
-
Your PXE boot fails — the server gets a DHCP IP but doesn't load the bootloader. What do you check?
-
1) Is the DHCP response including next-server (TFTP) and filename options? Check DHCP logs. 2) Is the TFTP server running and reachable from the server VLAN? 3) Is the boot filename correct for the boot mode? UEFI needs
grubx64.efi, Legacy needspxelinux.0. 4) Is there a firewall blocking TFTP (UDP 69) or the TFTP data port range? 5) Is the file actually present in the TFTP root directory? -
How do you securely wipe disks for server decommissioning?
- Follow NIST 800-88. Clear: single-pass overwrite with random + zero (
shred -n 1 -z /dev/sdX). Purge: firmware-level secure erase (ATA Secure Erase for SSDs:hdparm --security-erase). Destroy: physical destruction for classified data. Document the method used and get sign-off.
Phase 3: Data Center & Rack Operations (easy -> hard)¶
- How many U is a standard rack? What does "2U server" mean?
-
42U. One rack unit (1U) = 1.75 inches (44.45mm). A 2U server occupies 2 rack units vertically.
-
Why do you need blanking panels in empty U-spaces?
-
Without blanking panels, hot exhaust air from the rear recirculates through gaps to the cold aisle (server intake side). This creates hot spots and reduces cooling efficiency. A single 1U gap can raise inlet temps 2-3C for servers above it.
-
Explain the A+B power feed model for a rack.
-
Two independent power paths: Feed A and Feed B, from separate circuits/UPS/panels. PDU A on one side, PDU B on the other. Each server's PSU 1 connects to PDU A, PSU 2 to PDU B. If the entire A feed fails (breaker, UPS, cable), every server stays up on the B feed. Never exceed 80% of a single circuit's capacity (NEC code).
-
What is a ToR (Top-of-Rack) switch topology and why is it preferred?
-
Each rack has its own switch pair (redundant A/B) at the top. Servers connect to their local ToR with short cables (1-3m DAC). ToR switches uplink to spine/aggregation. Preferred because: short cable runs, easy to manage, scales by adding racks, fault isolation per rack.
-
You need to calculate the power budget for a new rack. Walk through it.
-
1) Sum all equipment wattage (servers * watts each + switches + PDU overhead). 2) Divide by 2 for per-feed draw (A and B each carry half). 3) Calculate circuit capacity: amps * voltage (e.g., 30A * 208V = 6,240W). 4) Apply 80% rule: usable = 6,240 * 0.8 = 4,992W per feed. 5) Verify per-feed draw is under 4,992W. Leave headroom for expansion.
-
What's the difference between Cat6a, DAC, and fiber OM4? When do you use each?
-
Cat6a: Copper, 10Gbps up to 100m, management network and moderate-speed connections. DAC (Direct Attach Copper): Twinax, 10-100Gbps, 1-7m only, cheapest and lowest latency for server-to-ToR. Fiber OM4: Multimode, 100Gbps up to 150m, inter-rack and backbone links where distance or speed exceeds DAC capability.
-
What ASHRAE class are Dell PowerEdge servers typically rated for? What does that mean practically?
-
ASHRAE A2: recommended operating range 18-27C (64-81F), allowable up to 35C (95F). Practically: keep cold aisle at 20-25C. Going above 27C won't kill servers immediately (allowable range extends higher) but reduces component lifespan and may trigger fan speed increases that raise noise and power consumption.
-
Describe a proper server decommissioning procedure from start to finish.
- 1) Drain workloads:
kubectl drain, remove from load balancers. 2) Remove from cluster:kubectl delete node. 3) Remove monitoring/alerts, DNS records, DHCP reservations. 4) Wipe disks per NIST 800-88 (Clear or Purge depending on data classification). 5) Reset iDRAC to factory:racadm racresetcfg. 6) Disconnect and label all cables, photograph. 7) Remove from rack. 8) Remove asset tags. 9) Update CMDB (status: Decommissioned, date, method). 10) Handoff for recycling/remarketing with certificate of destruction if applicable.
Scenario-Based Questions (hardest)¶
- You're in the data center and a server won't POST. Front panel LED is blinking amber. What's your process?
-
See training/interview-scenarios/11-server-wont-post.md for the full scenario. Summary: 1) Check iDRAC remotely (it's independent of host power) — read SEL. 2) At the rack: check both PSU LEDs, both PDU feeds. 3) Address power first (most common cause of no-POST). 4) After restoring power, try iDRAC power-on. 5) Monitor POST via virtual console. 6) Address secondary issues (ECC, thermal) after POST.
-
Your fleet of 50 Dell servers needs a BIOS update. How do you approach this safely?
-
1) Test the update on a non-production server first. 2) Export SCPs from one server as a rollback reference. 3) Use OME compliance baselines to identify which servers need the update. 4) Schedule rolling updates: cordon+drain the k8s node, apply BIOS update via iDRAC (requires reboot), verify POST and health, uncordon. 5) Do 1-2 servers, verify for an hour, then batch the rest. 6) Never update all servers simultaneously — maintain cluster quorum.
-
A disk in a RAID 5 with 4x 8TB NL-SAS drives fails. The hot spare kicks in and rebuild starts. 6 hours into the 18-hour rebuild, a second disk shows predicted failure. What do you do?
-
This is a URE (Unrecoverable Read Error) risk scenario. With RAID 5 and large NL-SAS drives, the probability of a read error during rebuild is significant. Options: 1) Do NOT remove the second disk — the array is already degraded, losing another disk means total data loss. 2) If the predicted failure is "imminent" vs "soon" matters. 3) Consider if this data is backed up (it should be). 4) If possible, copy critical data off the array now. 5) For the future: this is exactly why RAID 6 or RAID 10 should be used for arrays with large (4TB+) drives — RAID 5 with large drives has unacceptable rebuild risk.
-
Your company is building a new data center room for 20 racks of servers. What physical infrastructure decisions do you need to make?
- Power: utility feed capacity, generator sizing, UPS sizing (20 racks * ~7kW = 140kW minimum), A+B distribution, circuit/breaker planning, PDU selection (metered minimum). Cooling: total heat load = total power draw, CRAC/CRAH capacity with N+1 redundancy, hot/cold aisle containment design, raised floor vs slab. Network: spine-leaf vs core-agg topology, inter-rack fiber runs, management network VLAN design, ToR switch selection. Physical: rack layout (rows), aisle width (minimum 1.2m cold, 1m hot), weight load per tile/floor, fire suppression (clean agent, not water). Monitoring: environmental sensors (temp, humidity per rack), PDU power monitoring, leak detection, DCIM software.
Wiki Navigation¶
Related Content¶
- Datacenter & Server Hardware (Topic Pack, L1) — Out-of-Band Management, Rack & Stack, RAID
- Dell PowerEdge Servers (Topic Pack, L1) — Out-of-Band Management, RAID, Server Hardware
- Bare-Metal Provisioning (Topic Pack, L2) — Out-of-Band Management, Server Hardware
- Case Study: Cable Management Wrong Port (Case Study, L1) — Rack & Stack, Server Hardware
- Case Study: Database Replication Lag — Root Cause Is RAID Degradation (Case Study, L2) — RAID, Server Hardware
- Case Study: Link Flaps Bad Optic (Case Study, L1) — Rack & Stack, Server Hardware
- Case Study: Serial Console Garbled (Case Study, L1) — Out-of-Band Management, Server Hardware
- Case Study: Server Remote Console Lag (Case Study, L1) — Out-of-Band Management, Server Hardware
- Datacenter Drills (Drill, L1) — Out-of-Band Management, Rack & Stack
- Deep Dive: Dell Linux PowerEdge (deep_dive, L2) — Out-of-Band Management, Server Hardware
Pages that link here¶
- Bare-Metal Provisioning
- Bare-Metal Provisioning - Primer
- Datacenter & Hardware Drills
- Datacenter & Server Hardware
- Datacenter Ops Domain
- Dell PowerEdge Servers
- Dell PowerEdge on Linux - Deep Dive Guide
- IPMI and ipmitool
- Link Flaps - Bad Optic
- PDU Reporting Overload Warning
- Primer
- Primer
- Primer
- RAID Degraded Rebuild Latency
- RAID and Storage Internals