Dell PowerEdge — Street-Level Ops¶

Quick Diagnosis Commands¶

# Get the service tag (identity for Dell support, warranty, parts)
dmidecode -s system-serial-number
# Or from iDRAC:
racadm get system.info.servicetag

# Check overall server health via RACADM
racadm get system.health
# Or via Redfish API:
curl -sk -u admin:password \
  https://idrac-ip/redfish/v1/Systems/System.Embedded.1 \
  | jq '.Status.Health'

# List all physical disks and their status
perccli /c0/eall/sall show
# Or via RACADM:
racadm raid get pdisks -o

# Check for degraded RAID arrays
perccli /c0/vall show | grep -E 'Dgrd|Fail|Offline'

# Check iDRAC lifecycle logs for recent hardware events
racadm lclog view -s last -n 20
# Or Redfish:
curl -sk -u admin:password \
  https://idrac-ip/redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Lclog/Entries \
  | jq '.Members[-5:] | .[].Message'

# Check PSU status (both should be OK)
racadm get system.power.status
# Or from OS:
ipmitool sdr type "Power Supply" 2>/dev/null

# Check memory errors (ECC correctable errors are warnings)
racadm get system.memory
# Or from Linux:
edac-util -s 2>/dev/null || grep -i 'ce.*dimm\|ue.*dimm' /var/log/mcelog 2>/dev/null

# Check CPU and inlet temperatures
ipmitool sdr type Temperature 2>/dev/null
# Or:
racadm get system.thermals

> **Remember:** The service tag is the single most important identifier for Dell support. Memorize where to find it: `dmidecode -s system-serial-number` from the OS, the pull-out tag on the front bezel, or iDRAC. Every warranty lookup, parts order, and support call starts with this 7-character string.

# Get firmware versions for all components
racadm swinventory
# Compact Redfish version:
curl -sk -u admin:password \
  https://idrac-ip/redfish/v1/UpdateService/FirmwareInventory \
  | jq '.Members[] | .["@odata.id"]'

Gotcha: Server Unresponsive — iDRAC Still Available¶

Symptom: Server does not respond to SSH or ping. Applications are down. But iDRAC web interface works.

Rule: iDRAC runs on a separate BMC processor with its own NIC. It works even when the OS is completely crashed, kernel-panicked, or hung.

Under the hood: iDRAC is a full ARM-based computer embedded in the server with its own RAM, storage, and network stack. It draws power from standby rails even when the server is "off." This is why it can power-cycle the server, update firmware, and mount virtual media with zero OS involvement.

# Step 1: Check server power state
racadm serveraction powerstatus

# Step 2: Open virtual console to see what the screen shows
# iDRAC web UI → Dashboard → Virtual Console
# Look for: kernel panic, filesystem errors, BIOS POST hang

# Step 3: Check lifecycle log for hardware events
racadm lclog view -s last -n 10
# Look for: memory errors, CPU thermal throttle, PSU failure

# Step 4: If OS is hung (not kernel panic), try NMI first
racadm serveraction powernmi
# This triggers a crash dump on Linux if kdump is configured

# Step 5: If truly unresponsive, hard reset
racadm serveraction hardreset
# This is a cold reset — equivalent to pulling the power cord

# Step 6: If server does not come back after reset, check POST
# Watch virtual console for BIOS errors
# Common: failed DIMM detected, disk controller initialization failure

Gotcha: RAID Rebuild in Progress — Do Not Touch Anything¶

Symptom: A disk failed and was replaced. The RAID array is rebuilding. Performance is degraded.

Rule: During a RAID rebuild, the array is in its most vulnerable state. A second disk failure means data loss (on RAID 5). Do not add more I/O load, do not reboot, do not run firmware updates.

# Check rebuild progress
perccli /c0/vall show | grep -i 'rbld'
# Or:
perccli /c0/eall/sall show rebuild

# Monitor rebuild percentage
watch -n 30 'perccli /c0/eall/sall show rebuild 2>/dev/null | grep -i rbld'

# Estimated rebuild times (rough):
# 1TB HDD  → 2-4 hours
# 4TB HDD  → 8-16 hours
# 8TB HDD  → 16-24+ hours
# 960GB SSD → 30-60 minutes

# During rebuild:
# - Do NOT reboot the server
# - Do NOT update PERC firmware
# - Do NOT add more disk I/O (skip backups if possible)
# - Do NOT remove any other disks from the enclosure
# - DO monitor for a second failure: set up alerts on disk health

# Check remaining disk health while rebuilding
perccli /c0/eall/sall show all | grep -E 'Media Error|Other Error|Pred'
# Predictive failure on another disk during rebuild = emergency

> **War story:** A RAID 5 rebuild on 8TB HDDs took 22 hours. During hour 18, a second disk threw a predictive failure warning. The admin had to decide: keep rebuilding and pray, or stop everything and copy data off first. They chose to finish the rebuild and got lucky. RAID 6 or RAID 10 would have survived the second failure. Use RAID 5 only for small or non-critical arrays.

Pattern: Remote OS Install via iDRAC Virtual Media¶

When you need to install an OS on a server with no local USB access:

# Step 1: Mount ISO via RACADM
racadm remoteimage -c -u admin -p password \
  -l //fileserver/share/ubuntu-22.04.iso

# Step 2: Set one-time boot to virtual CD
racadm set iDRAC.ServerBoot.BootOnce Enabled
racadm set iDRAC.ServerBoot.FirstBootDevice VCD-DVD

# Step 3: Reboot
racadm serveraction powercycle

# Step 4: Watch virtual console for installer
# iDRAC web UI → Virtual Console → Launch

# Alternative: Redfish virtual media mount
curl -sk -u admin:password -X POST \
  https://idrac-ip/redfish/v1/Managers/iDRAC.Embedded.1/VirtualMedia/CD/Actions/VirtualMedia.InsertMedia \
  -H 'Content-Type: application/json' \
  -d '{"Image": "http://fileserver/ubuntu-22.04.iso"}'

Pattern: Fleet Firmware Update via Redfish¶

# Step 1: Check current BIOS version
curl -sk -u admin:password \
  https://idrac-ip/redfish/v1/Systems/System.Embedded.1 \
  | jq '.BiosVersion'

# Step 2: Upload firmware update
curl -sk -u admin:password -X POST \
  https://idrac-ip/redfish/v1/UpdateService/Actions/UpdateService.SimpleUpdate \
  -H 'Content-Type: application/json' \
  -d '{"ImageURI": "http://firmware-repo/BIOS_XXXXX.exe",
       "@Redfish.OperationApplyTime": "OnReset"}'

# Step 3: Schedule reboot to apply
curl -sk -u admin:password -X POST \
  https://idrac-ip/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset \
  -H 'Content-Type: application/json' \
  -d '{"ResetType": "GracefulRestart"}'

# Step 4: Monitor job status
curl -sk -u admin:password \
  https://idrac-ip/redfish/v1/TaskService/Tasks \
  | jq '.Members[] | select(.TaskState != "Completed") | {Id, TaskState, PercentComplete}'

Pattern: Pre-Deployment Health Check Script¶

Run this before putting a new or repurposed server into production:

#!/bin/bash
# pre-deploy-check.sh — run on the server after OS install
echo "=== Service Tag ==="
dmidecode -s system-serial-number

echo "=== CPU ==="
lscpu | grep -E 'Model name|Socket|Core|Thread'

echo "=== Memory ==="
dmidecode -t memory | grep -E 'Size:|Locator:|Type:' | grep -v 'No Module'

echo "=== Disk Health ==="
perccli /c0/vall show 2>/dev/null || echo "No PERC controller (HBA mode?)"
lsblk -d -o NAME,SIZE,MODEL,ROTA

echo "=== Network ==="
ip -br link | grep -v lo
ethtool eth0 2>/dev/null | grep -E 'Speed|Link detected'

echo "=== PSU ==="
ipmitool sdr type "Power Supply" 2>/dev/null || echo "IPMI not available"

echo "=== Temperature ==="
ipmitool sdr type Temperature 2>/dev/null | head -5

echo "=== Firmware ==="
dmidecode -s bios-version

Pattern: iDRAC Network Setup from Front Panel¶

When a new server has no iDRAC IP configured:

1. Connect a laptop to the iDRAC Direct micro-USB port (front panel)
2. The laptop gets an IP via USB RNDIS (typically 169.254.0.x range)
3. Browse to https://169.254.0.1 (or the IP shown in BIOS iDRAC config)
4. Log in with default credentials (root/calvin for iDRAC 9)
5. Set the dedicated NIC IP: iDRAC Settings → Network → IPv4
6. CHANGE THE DEFAULT PASSWORD immediately

> **Gotcha:** Default iDRAC 9 credentials are `root/calvin`. These are widely known and actively scanned for on the internet. If your iDRAC is reachable from the network (even briefly) with default credentials, assume it has been compromised. Always change the password before connecting the management NIC to any network.
7. Verify remote access from the management network

Scenario: Disk LED Blinking Amber — Identifying the Failed Drive¶

# Step 1: Check which disk is failed
perccli /c0/eall/sall show | grep -E 'Fail|Offln|UBad'
# Output: 0:1:2  Failed  HDD  1.818 TB
# This means: enclosure 0, slot 2

# Step 2: Blink the LED on the failed drive to physically locate it
perccli /c0/e0/s2 start locate
# The drive LED will blink — walk to the rack and find it

# Step 3: After replacing the drive
perccli /c0/e0/s2 stop locate

# Step 4: If the new disk is not auto-detected for rebuild
perccli /c0/e0/s2 set good
# Then start rebuild if needed:
perccli /c0/e0/s2 start rebuild

Useful One-Liners¶

# Warranty check (requires service tag)
# Visit: https://www.dell.com/support/home → enter service tag

# Collect full hardware inventory for CMDB
dmidecode | grep -A3 -E 'System Information|Processor|Memory Device' > hw_inventory.txt

# Check PERC battery/supercap health (write-cache depends on this)
perccli /c0 show bbu 2>/dev/null || perccli /c0 show supercap 2>/dev/null

> **Debug clue:** If RAID write performance suddenly drops by 10x, check the BBU/supercap. When the battery fails, the PERC controller silently disables write-back cache and falls back to write-through mode. Performance craters but data is safe. Replace the BBU to restore write-back performance.

# Check if server is in HBA mode (passthrough) vs RAID mode
perccli /c0 show personality 2>/dev/null
# Or check for virtual disks:
perccli /c0/vall show 2>/dev/null | grep -c "^[0-9]"
# 0 virtual disks + direct-attached disks = HBA mode

# Force iDRAC reset without rebooting the server
racadm racreset

# Export Server Configuration Profile (golden image for fleet)
racadm get -t xml -f server_config_profile.xml

# Check for predictive disk failures (SMART via PERC)
perccli /c0/eall/sall show all | grep -i 'predictive'

Default trap: New PowerEdge servers ship with the PERC controller in RAID mode, which hides individual disks behind virtual disks. If you plan to use software RAID (mdadm), ZFS, or Ceph, you need to switch the controller to HBA mode (passthrough) via iDRAC before OS install. Switching modes after the OS is installed destroys all virtual disks.