Skip to content

Portal | Level: L2: Operations | Topics: Firmware / BIOS / UEFI, Server Hardware | Domain: Datacenter & Hardware

Scenario: Server Won't POST in the Data Center

The Prompt

"You arrive at the data center for a scheduled maintenance window. One of the Dell PowerEdge servers won't POST — the front panel is showing a blinking amber LED and there's nothing on the console. The server was running a Kubernetes worker node. Walk through your troubleshooting process."

Initial Report

Monitoring alert: "Node svr-a01-22 is NotReady. Last heartbeat 45 minutes ago. 12 pods unscheduled due to insufficient resources."

iDRAC alert (email): "System Event Log entry: Critical — Power Supply 1 input lost. Critical — System board inlet temp above upper critical threshold. Warning — Memory correctable ECC error rate exceeded on DIMM A1."

Constraints

  • Time pressure: The maintenance window is 4 hours. Other servers need firmware updates during this window. The cluster is running degraded — pods are pending.
  • Limited physical access: You're in the data center, but spare parts are in a locked cage that requires a facilities ticket (30-minute lead time). You have a laptop, console cables, and basic tools.
  • Team context: The on-call SRE already drained the node from the cluster before you arrived. Workloads have been rescheduled to other nodes, but the cluster is running tight on capacity.

Observable Evidence

  • Front panel: Blinking amber LED (critical fault). Power button unresponsive.
  • iDRAC: Reachable via network. Web console shows system powered off. SEL has three entries (see Initial Report).
  • Physical inspection: PSU 1 LED is off (no AC indicator). PSU 2 LED is solid green. The server is in rack A01, U22-23. PDU A outlet 9 shows 0W draw. PDU B outlet 9 shows normal standby draw (~15W).
  • Ambient conditions: Cold aisle temperature reads 24C (normal). The server above (U24-25) is running fine.

Expected Investigation Path

# 1. Remote triage via iDRAC (before walking to the rack)
# Check SEL for root cause clues
curl -sk -u root:password \
  https://idrac-a01-22/redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Sel/Entries \
  | jq '.Members[] | select(.Severity != "OK") | {Created, Message, Severity}'

# Check power state
curl -sk -u root:password \
  https://idrac-a01-22/redfish/v1/Systems/System.Embedded.1 \
  | jq '{PowerState, Status: .Status.Health}'

# 2. Physical inspection at the rack
# Check both PSU LEDs (A and B)
# Check PDU outlet status (both feeds)
# Check power cable connections
# Check for loose cables, damage, or anything unusual

# 3. Isolate the PSU issue
# PSU 1 has no AC input. Check:
#   - PDU A outlet 9: is it on? -> Check PDU management interface
#   - Power cable from PDU A to PSU 1: seated? damaged?
#   - Try reseating the power cable
#   - Try a different PDU A outlet

# 4. After restoring power
# Try powering on via iDRAC
racadm -r idrac-a01-22 -u root -p password serveraction powerup

# Or via Redfish
curl -sk -u root:password \
  -X POST https://idrac-a01-22/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset \
  -H 'Content-Type: application/json' \
  -d '{"ResetType": "On"}'

# 5. Monitor POST progress via iDRAC virtual console
# Watch for memory training, PERC init, PXE/boot

# 6. Address the ECC error
# Check DIMM health after POST
racadm -r idrac-a01-22 -u root -p password getsensorinfo | grep -i dimm

# If DIMM A1 continues to show errors, plan replacement
# For now: server will still POST and run with ECC errors
# Schedule DIMM replacement during next maintenance window

# 7. Rejoin the cluster
kubectl uncordon svr-a01-22
kubectl get node svr-a01-22

Strong Answer

"First, I'd triage remotely before touching anything physical. The iDRAC is reachable, so I'd pull the SEL — that tells me exactly what happened. Here we have three events: PSU 1 lost input power, a thermal warning, and ECC memory errors. The blinking amber LED and the inability to POST all point to a power issue as the immediate blocker.

At the rack, I'd check both PSU LEDs. PSU 1 is dark — no AC. PSU 2 is green — it has power but the server is off. This means PSU 1 losing power may have caused an ungraceful shutdown, or the server firmware decided not to boot with a degraded power configuration. I'd trace the power cable from PSU 1 back to PDU A. Is the PDU outlet powered? Is the cable seated? Most likely this is a loose cable or a tripped PDU breaker.

Once I reseat or fix the power cable and verify both PSUs show AC input, I'd power the server on via iDRAC — either racadm serveraction powerup or the Redfish reset endpoint. Then I'd watch POST via the iDRAC virtual console.

The ECC error on DIMM A1 is a separate issue — it won't prevent POST, but it's a leading indicator of DIMM failure. I'd check if the error rate is climbing. If so, I'd request a replacement DIMM from Dell (it's under warranty) and schedule the swap during the next maintenance window. For now, the server will run fine with correctable ECC errors.

The thermal warning was likely a cascading effect — when PSU 1 lost power, the fans may have ramped down or the server powered off ungracefully. Once both PSUs are back and the server boots, inlet temps should normalize.

After the server boots and the OS is up, I'd verify the Kubernetes kubelet is running, then kubectl uncordon the node so pods can schedule back onto it. I'd watch the node for 10-15 minutes to make sure it's healthy before moving on to the other firmware updates in this maintenance window.

Finally, I'd document everything: update the SEL review in the CMDB, note the DIMM A1 issue for follow-up, and file a ticket for the spare DIMM."

Common Traps

  • Jumping to physical action without remote triage — the iDRAC is reachable, use it first. The SEL tells you what happened before you walk to the rack.
  • Ignoring the ECC error because the PSU seems more urgent — the ECC error is a separate issue that needs tracking. Missing it means a future unplanned outage.
  • Not checking both power feeds — redundant power exists for a reason. Always verify both A and B.
  • Forgetting to uncordon the node — fixing the hardware is only half the job. The node needs to rejoin the cluster.
  • Not documenting — no CMDB update means the next person hits the same issue blind.
  • Panic-replacing hardware — ECC errors are correctable. The server will run. Plan the replacement, don't emergency-swap during a maintenance window that has other work scheduled.
  • Reference: training/library/guides/datacenter/dell-server-management.md — iDRAC, diagnostics, RAID
  • Reference: training/library/guides/datacenter/rack-operations.md — power distribution, thermal management
  • Reference: training/library/guides/datacenter/bare-metal-provisioning.md — reprovisioning if needed
  • Skillcheck: training/library/skillchecks/datacenter.skillcheck.md

Wiki Navigation