Redfish -- Footguns¶
Mistakes that break automation, lock out operators, and create security gaps. Each one has been seen in production.
1. Hardcoding System URIs Across Vendors¶
What happens: Your automation uses /redfish/v1/Systems/System.Embedded.1 (Dell) and silently fails or 404s on HPE servers that use /redfish/v1/Systems/1.
Why people do it: "It works on our Dell fleet" and nobody tests on other vendors.
Fix: Always discover the system URI from the Members collection:
# Correct: discover dynamically
SYSTEM_URI=$(curl -sk -u $CREDS https://$BMC/redfish/v1/Systems \
| jq -r '.Members[0]."@odata.id"')
# Wrong: hardcoded
SYSTEM_URI="/redfish/v1/Systems/System.Embedded.1"
2. Ignoring HTTP Status Codes¶
What happens: Your script sends a PATCH to change a BIOS setting. The BMC returns 400 (bad request) or 405 (method not allowed). Your script doesn't check and reports success. The setting was never applied.
Why people do it: curl exits 0 on HTTP errors. People forget -f or don't check $? / -w "%{http_code}".
Fix: Always check the HTTP status code:
HTTP_CODE=$(curl -sk -u "$CREDS" -X PATCH \
"https://$BMC$SYSTEM" \
-H 'Content-Type: application/json' \
-d '{"Boot": {"BootSourceOverrideTarget": "Pxe"}}' \
-o /tmp/response.json -w "%{http_code}")
if [ "$HTTP_CODE" -ge 400 ]; then
echo "FAILED ($HTTP_CODE): $(jq -r '.error.message // .error' /tmp/response.json)"
exit 1
fi
3. Not Checking AllowableValues Before Acting¶
What happens: You POST a ResetType of GracefulRestart but the BMC only supports ForceRestart. You get a 400 error and can't power cycle the server during an incident.
Why people do it: They memorize one set of values from one vendor and assume they're universal.
Fix: Check AllowableValues first, especially in automation that targets mixed hardware:
curl -sk -u "$CREDS" "https://$BMC$SYSTEM" \
| jq '.Actions."#ComputerSystem.Reset"."ResetType@Redfish.AllowableValues"'
4. Session Exhaustion¶
What happens: Your monitoring script creates a new Redfish session every 60 seconds and never logs out. After a few hours, the BMC hits its session limit (typically 4-8 concurrent sessions). All new connections fail, including your web UI access.
Why people do it: Scripts that curl with basic auth work fine. Scripts that use sessions but crash before cleanup leak sessions.
Fix: - Use Basic Auth for simple polling scripts (one request per invocation) - If using sessions: always clean up in a trap handler - Set session timeouts on the BMC (most default to 30 minutes)
# Cleanup trap
cleanup() { curl -sk -X DELETE -H "X-Auth-Token: $TOKEN" "https://$BMC$SESSION_URI" 2>/dev/null; }
trap cleanup EXIT
5. Using -k (Insecure) in Production¶
What happens: Every script uses curl -k to skip TLS certificate verification. An attacker on the management VLAN performs a MITM attack and captures BMC credentials in plaintext.
Why people do it: BMCs ship with self-signed certificates. Getting -k working is faster than deploying proper certs.
Fix: Deploy certificates from your internal CA to every BMC. This is a one-time cost per server (automate it during provisioning):
# Use proper TLS in production
curl -s --cacert /etc/pki/bmc-ca.pem -u "$CREDS" "https://$BMC/redfish/v1/"
# Only use -k in lab/dev environments
6. Patching BIOS Settings Without Checking Current Values¶
What happens: You PATCH a BIOS setting (e.g., enabling SR-IOV). The BMC queues the change for next reboot. But the server is already in the middle of a BIOS update that requires its own reboot. The settings conflict and the server boots into an unexpected state.
Why people do it: "Just PATCH and forget" feels simpler than read-modify-write.
Fix: Always GET current settings before PATCH. Check for pending changes:
# Check for pending BIOS changes
curl -sk -u "$CREDS" "https://$BMC$SYSTEM/Bios/Settings" \
| jq '.Attributes'
# If this returns non-empty, changes are already pending — understand what before adding more
7. Clearing the SEL Without Exporting¶
What happens: Intermittent hardware errors. Someone clears the SEL to "clean up." Later, the server crashes. No event history to diagnose the root cause. The warranty claim needs the SEL. It's gone.
Why people do it: The SEL is full, alerts are noisy, and "clear" seems like the obvious fix.
Fix: Always export before clearing. Automate this as a cron job or monitoring hook:
# Export first, always
curl -sk -u "$CREDS" "https://$BMC$MANAGER/LogServices/Sel/Entries" \
| jq '.Members[]' > "sel-$(date +%F).json"
# Then clear
curl -sk -u "$CREDS" -X POST \
"https://$BMC$MANAGER/LogServices/Sel/Actions/LogService.ClearLog"
8. Running Firmware Updates Without Staging¶
What happens: You push a firmware update via SimpleUpdate to 50 servers simultaneously. The firmware repo gets hammered, transfers fail midway, and 12 servers are stuck in a partial update state. Some need AC power cycles to recover.
Why people do it: Redfish makes firmware updates look easy — one POST per server.
Fix: - Stage firmware to a local repo (NFS, HTTP) close to the servers - Update in waves (5-10 servers at a time, not the whole fleet) - Monitor task completion before starting the next wave - Have a rollback plan for each component
9. Forgetting Redfish and IPMI Are Independent¶
What happens: You rotate passwords via Redfish AccountService. The IPMI user table is separate on some BMCs. ipmitool still works with the old password. You think the rotation is complete, but the old creds still have full IPMI access.
Why people do it: They assume Redfish and IPMI share the same user database. Some BMCs do, some don't.
Fix: After rotating via Redfish, verify IPMI access is also updated (or disabled):
# Verify IPMI with new creds
ipmitool -I lanplus -H $BMC -U admin -P newpass chassis status
# Better: disable IPMI-over-LAN entirely if Redfish is your primary interface
10. No Timeout on Redfish Requests¶
What happens: A BMC is slow or unresponsive. Your fleet script hangs on one server, blocking the entire run. Your provisioning pipeline stalls for hours.
Why people do it: curl has no default timeout.
Fix: Always set connect and max-time timeouts:
# Set 10s connect timeout, 30s total timeout
curl -sk --connect-timeout 10 --max-time 30 -u "$CREDS" \
"https://$BMC/redfish/v1/Systems/System.Embedded.1"
11. Using Redfish for SOL and Getting Nothing¶
What happens: Someone tries to find a Redfish endpoint for Serial-over-LAN. There isn't one. They waste time searching vendor docs for something that doesn't exist in the standard.
Why people do it: "Redfish replaces IPMI" — they assume it replaces everything.
Fix: Accept that SOL is still an IPMI/ipmitool operation (or use vendor-specific virtual console: iDRAC HTML5, iLO virtual serial port). Redfish has no standardized interactive console.
12. Treating All 2xx Responses as Success¶
What happens: You POST a firmware update. The BMC returns 202 (Accepted) with a task URI. Your script sees "2xx = success" and moves on. The actual update fails 5 minutes later because the firmware image was incompatible. Nobody checks the task.
Why people do it: Checking async task completion requires polling, which is more code.
Fix: For any operation that returns a task, poll the task to completion:
TASK_URI=$(curl -sk -u "$CREDS" -X POST \
"https://$BMC/redfish/v1/UpdateService/Actions/UpdateService.SimpleUpdate" \
-H 'Content-Type: application/json' \
-d '{"ImageURI": "https://repo/firmware.exe"}' \
-D - -o /dev/null | grep -i Location | awk '{print $2}' | tr -d '\r')
# Poll until complete
while true; do
STATE=$(curl -sk -u "$CREDS" "https://$BMC$TASK_URI" | jq -r '.TaskState')
case "$STATE" in
Completed) echo "Success"; break ;;
Exception|Killed) echo "FAILED"; exit 1 ;;
*) sleep 30 ;;
esac
done