IPMI and ipmitool Footguns¶
Mistakes that brick BMCs, lock out operators, cause data loss, and miss critical hardware events.
1. Running power cycle instead of power soft on a responsive server¶
The server is slow. A ticket says "unresponsive." You fire ipmitool power cycle without checking. The OS was alive — just under heavy I/O. You just caused a dirty shutdown: filesystem journal replay, database crash recovery, possible corruption. On a database server, this can mean 45 minutes of recovery and potential data loss.
Why people do it: Panic. "Server down" feels urgent. power cycle is the fastest path to "it's rebooting."
Fix: Always try power soft first (sends ACPI shutdown signal). Wait 60 seconds. Check SOL console for shutdown activity. Only hard-cycle after confirming the OS is truly hung. If you can't SSH in, SOL is your confirmation tool — not absence of SSH.
2. Forgetting to archive the SEL before clearing it¶
The SEL is full. You run ipmitool sel clear to free space. Two weeks later, a pattern of memory errors leads to a DIMM failure. You need the historical SEL to prove the DIMM was degrading and should have been replaced during the last maintenance window. The data is gone.
Why people do it: SEL clear is a quick fix for "SEL full" alerts. Nobody thinks about the forensic value of old events.
Fix: Always sel elist > /var/log/sel-archive/$(hostname)-$(date +%F).log before clearing. Automate it: weekly cron job that archives and clears. Ship SEL data to your centralized logging system.
3. Setting persistent PXE boot instead of one-shot¶
You need to PXE boot a server for reinstall. You run ipmitool chassis bootdev pxe options=persistent. The install completes. You forget to reset the boot order. The server reboots for a kernel update next week and PXE boots again — wiping itself.
Why people do it: persistent feels safer than wondering if the one-shot will work. Some operators don't know there's a difference.
Fix: Use one-shot by default (chassis bootdev pxe without options=persistent). It reverts to BIOS boot order after one reboot. If you must use persistent, add a post-install step that resets boot order: ipmitool chassis bootdev disk or efibootmgr from the installed OS.
4. Using cipher suite 0 (no authentication)¶
Some BMCs come with cipher suite 0 enabled, which means IPMI commands are accepted with no username or password. Anyone on the management network can power off your servers, read sensor data, or flash firmware.
Why it happens: Cipher suite 0 is sometimes enabled by default for compatibility with legacy management software. Nobody audits it.
Fix: Test every BMC: ipmitool -I lanplus -H $BMC -C 0 -U "" -P "" chassis status. If it works, disable cipher 0 immediately. Dell: racadm set iDRAC.IPMILan.CipherSuiteSelections 1,2,3,6,7,8,11,12. Add this check to your provisioning automation.
CVE: CVE-2013-4786 demonstrated that IPMI 2.0's RAKP authentication protocol leaks the password hash to any network attacker, regardless of cipher suite. The hash can be cracked offline. Combined with cipher suite 0, an attacker needs no credentials at all. IPMI management interfaces must never be exposed to untrusted networks — this is a protocol-level weakness, not a misconfiguration.
5. Cold-resetting the BMC during a firmware update¶
The BMC firmware update is taking longer than expected. You get impatient and run ipmitool mc reset cold to "speed things up." The BMC was mid-flash. Now it has a corrupted firmware image. The BMC is bricked. You need physical access to recover it (JTAG header, SPI flash programmer, or motherboard replacement).
Why people do it: Firmware updates on BMCs have no progress bar via IPMI. It looks like nothing is happening. Impatience wins.
Fix: Never interrupt a BMC firmware update. If you started one via the web UI, wait for it to complete (can take 10-15 minutes). Monitor via the vendor's update tool, not by poking the BMC with ipmitool. Most modern BMCs have dual firmware banks — the update writes to the backup bank and switches on reboot, so a failed update should be recoverable. But don't test that theory in production.
6. Not syncing the BMC clock¶
The BMC has its own real-time clock, independent of the host OS. Over months, it drifts. Your SEL timestamps say an event happened at 14:00, but the actual time was 14:47. You can't correlate SEL events with OS logs, network captures, or monitoring alerts. During an incident, this wastes critical minutes.
Why people do it: Nobody thinks about the BMC clock. It's set once during provisioning and never checked again.
Fix: Sync the BMC clock regularly. Add to your provisioning post-install: ipmitool sel time set "$(date -u '+%m/%d/%Y %H:%M:%S')". Run it periodically via cron or configuration management. Some BMCs support NTP — enable it if available. On Dell iDRAC: racadm set iDRAC.NIC.DNSRegister Enabled and configure NTP servers.
7. Leaving the default IPMI user (root/calvin, ADMIN/ADMIN) active¶
You create a new admin user during provisioning. You set a strong password. But you never disable or change the password on the default user (ID 2). An attacker or misconfigured script uses the default credentials and has full BMC access.
Why people do it: "We created a new user, so we're fine." But the old user still works. IPMI doesn't enforce password policies or lock accounts after failed attempts.
Fix: After creating your operational user, either change the default user's password or disable it entirely:
# Option 1: disable default user
ipmitool user disable 2
# Option 2: change password to something random
ipmitool user set password 2 "$(openssl rand -base64 24)"
Add this to your provisioning automation. Audit with: ipmitool user list 1 across your fleet.
8. Not loading IPMI kernel modules on the host OS¶
You SSH into a server and try ipmitool sensor list. It fails with "Could not open device at /dev/ipmi0" or "No such file or directory." You assume ipmitool is broken and start debugging the wrong thing. The real issue: the IPMI kernel modules aren't loaded.
Why people do it: Most Linux distros don't load ipmi_devintf and ipmi_si by default. If nobody added them to the provisioning kickstart, they're missing.
Fix: Add to your base image or kickstart %post:
modprobe ipmi_devintf
modprobe ipmi_si
echo "ipmi_devintf" >> /etc/modules-load.d/ipmi.conf
echo "ipmi_si" >> /etc/modules-load.d/ipmi.conf
Verify with ls /dev/ipmi0. If the modules load but the device doesn't appear, check BIOS — "IPMI over KCS" or "IPMI BMC" must be enabled.
9. Sending raw IPMI commands from Stack Overflow without understanding them¶
Supermicro fan speed control requires raw IPMI commands (e.g., ipmitool raw 0x30 0x70 0x66 0x01 0x00 0x64). You copy one from a forum post. It's for a different motherboard revision. Instead of setting fans to 100%, it disables automatic fan control entirely. CPUs thermal-throttle, then shut down on thermal trip. Or worse: the raw command writes to a BMC register that requires a reflash to undo.
Why people do it: Vendor documentation for raw IPMI commands is sparse or nonexistent. Forum posts are the only source. The commands look opaque, so people copy-paste without understanding.
Fix: Never run raw IPMI commands you don't fully understand. Verify the command against your specific motherboard model and BMC firmware version. Test on a lab machine first. If you must use raw commands, document what each byte means. Prefer vendor CLIs (racadm, sum, hponcfg) over raw IPMI — they validate inputs.
10. Running ipmitool across the fleet with passwords in command arguments¶
You write a fleet management script: ipmitool -I lanplus -H $BMC -U admin -P MySecret123 power status. The password is visible in ps aux, shell history, CI logs, and the script itself.
Why people do it: It works. It's easy to script. The -P flag is the obvious choice.
Fix: Use the -E flag, which reads the password from the IPMI_PASSWORD environment variable:
Or use -f to read from a file:
echo -n "MySecret123" > /etc/ipmi-creds/bmc-password
chmod 600 /etc/ipmi-creds/bmc-password
ipmitool -I lanplus -H $BMC -U admin -f /etc/ipmi-creds/bmc-password power status
Better yet: use a secrets manager (Vault, AWS Secrets Manager) and inject credentials at runtime.
11. Ignoring "SEL is full" alerts¶
The monitoring system fires a "BMC SEL is full" alert. You snooze it — "it's just logs." The SEL fills up. New events are dropped. A PSU fails silently. A fan dies silently. The next alert you get is a thermal shutdown.
Why people do it: SEL alerts feel like noise compared to application alerts. They don't page anyone. They're "infrastructure housekeeping."
Fix: Treat SEL full as a P3 alert — not urgent, but must be handled within 24 hours. Automate collection and clearing. If you're ignoring SEL alerts, your monitoring stack is misconfigured — route them to the infrastructure team's queue, not to /dev/null.
12. Testing BMC changes on production servers¶
You want to change the BMC VLAN configuration. You run ipmitool lan set 1 vlan id 200 on a production server's BMC. The management switch port is on VLAN 100. The BMC is now unreachable. You need physical hands to fix it.
Why people do it: "It's just a network config change on the BMC, it won't affect the host." True — it won't affect the host. But it will make the BMC unreachable, which means you've lost your emergency access to that server.
Fix: Test BMC network changes on a lab server first. When changing VLAN or IP on production BMCs, verify you have physical or in-band access as a fallback. Better: run ipmitool lan set from the host OS (in-band via /dev/ipmi0), so even if you misconfigure the BMC network, you can still fix it via SSH.