Power & UPS Footguns¶

Mistakes that cause cascading power outages or data loss during power events.

1. Single PSU servers in production¶

Your server has one PSU. PDU-A trips a breaker. Every single-PSU server on that PDU goes down instantly, even though PDU-B is fine. You just lost half the tier because someone saved $200 per server on redundant power supplies.

Fix: All production servers must have 1+1 PSU redundancy. Each PSU connects to a different PDU. Verify with ipmitool sdr type "Power Supply".

2. Both PSUs on the same PDU¶

Your server has dual PSUs, but both are plugged into PDU-A. When PDU-A loses power, both PSUs die simultaneously. Redundancy is zero.

Fix: Cable PSU-1 to PDU-A and PSU-2 to PDU-B. Verify by checking PDU outlet maps against server rear panel labels. Audit quarterly.

3. Not testing UPS battery health¶

Your UPS shows "online" for three years. The battery has degraded to 20% actual capacity. Power fails. Instead of 15 minutes of runtime, you get 90 seconds. The generator has not started yet. Servers crash ungracefully.

Fix: Schedule monthly UPS battery tests. Monitor upsc myups battery.charge and battery.runtime under load. Replace batteries per vendor schedule (typically every 3-5 years).

War story: In 2023, a Google Cloud US-East5 outage lasted 6+ hours. Root cause: utility power loss triggered a "cascading failure" in the UPS system. The UPS batteries failed to sustain the load, and the generator handoff did not complete in time. Google worked with their UPS vendor to remediate battery system issues. Batteries degrade — a 3-year-old battery at 20% capacity gives 90 seconds instead of 15 minutes.

4. Shutdown order wrong — storage dies before servers¶

Power is failing. You panic and shut everything down simultaneously. The storage array powers off while servers are still writing. Filesystem corruption on every server. Hours of fsck and possible data loss.

Fix: Follow the order: applications, VMs, hypervisors, storage arrays, network switches. Storage goes down second-to-last because everything else depends on it. Script the sequence and test it.

Remember: Shutdown order mnemonic: A-V-H-S-N (Apps, VMs, Hypervisors, Storage, Network). Startup is the reverse: N-S-H-V-A. Storage is always the second-to-last down and second up because everything depends on it for persistent state.

5. No automatic shutdown configured¶

UPS goes to battery. Nobody is awake. Battery drains to zero. Every server crashes without flushing buffers. Database journals are corrupt. Application state files are half-written.

Fix: Configure NUT or apcupsd for automatic graceful shutdown when battery reaches a threshold. Test it with a simulated power event. Verify with grep SHUTDOWNCMD /etc/nut/upsmon.conf.

6. Overloading a PDU circuit¶

You add four new servers to a rack without checking the PDU amperage. The circuit is at 95% capacity. One server spikes during boot and trips the breaker. Every server on that circuit goes down.

Fix: Never exceed 80% of rated circuit capacity. Monitor per-circuit amperage via SNMP. Check before adding servers: snmpwalk -v2c -c public pdu.dc.local <oid>.

Under the hood: NEC (National Electrical Code) 210.20 requires circuits to be derated to 80% for continuous loads (>3 hours). A 30A circuit is rated for 24A continuous. Boot storms can spike draw 2-3x above steady state — four servers booting simultaneously after a power event draw significantly more than in normal operation.

7. Generator not tested under load¶

Your generator starts monthly on a test schedule, but it runs at idle with no load. During a real outage, it starts fine but cannot handle the load. Voltage sags. UPS goes to bypass. Servers see dirty power and crash.

Fix: Run monthly generator load tests at 75-100% of rated capacity. Verify ATS transfer. Check voltage and frequency stability under load.

8. Ignoring UPS bypass mode¶

A UPS technician puts the UPS in bypass for maintenance and forgets to take it out. The UPS is now a pass-through — no battery protection. Power flickers. Servers crash because there is zero transfer time.

Fix: Monitor UPS mode. Alert on bypass: upsc myups ups.status should show OL, not BYP. Never leave bypass unattended. Create a post-maintenance checklist.

9. No monitoring on PDU inlet temperature¶

Cooling fails in the row. Inlet temperature climbs to 45C. Servers throttle, then shut down on thermal protection. You did not know because nobody was monitoring the PDU environmental sensor.

Fix: Monitor inlet temperature per rack. Alert at 35C warning, 40C critical. PDUs with environmental sensors report via SNMP. Feed into Prometheus or Nagios.

10. Power event with no communication plan¶

UPS goes to battery. The on-call engineer sees the alert but does not know the shutdown procedure, who to call, or whether the generator is supposed to start. Precious minutes of battery runtime are wasted on confusion.

Fix: Document the power event runbook. Include: who to page, shutdown sequence, generator status check, escalation contacts for facilities. Post it physically in the datacenter and digitally in the on-call wiki.