Skip to content

Power & UPS - Street-Level Ops

Real-world power monitoring, UPS management, and emergency shutdown workflows.

Check UPS status — is it on battery?

# NUT (Network UPS Tools)
upsc myups ups.status
# OL                    (OL=online, OB=on battery, LB=low battery)

upsc myups battery.charge
# 100

upsc myups battery.runtime
# 1800                  (seconds — 30 minutes of runtime)

# APC UPS
apcaccess status
# STATUS   : ONLINE
# BCHARGE  : 100.0 Percent
# TIMELEFT : 30.0 Minutes
# LOADPCT  : 42.5 Percent

Remember: NUT status codes mnemonic: OL = Online (utility power), OB = On Battery, LB = Low Battery (shutdown imminent). When you see OB LB together, you have minutes or less before the UPS dies — start the shutdown sequence immediately.

Read server power consumption

# IPMI power reading
ipmitool dcmi power reading
# Instantaneous power reading:  345 Watts
# Minimum power reading:        280 Watts
# Maximum power reading:        510 Watts
# Average power reading:        350 Watts

# PSU status
ipmitool sdr type "Power Supply"
# PS1 Status       | ok    | Presence detected
# PS2 Status       | ok    | Presence detected

Monitor PDU per-outlet power via SNMP

# Walk a Raritan or APC PDU for outlet power readings
snmpwalk -v2c -c public pdu-a.dc.local .1.3.6.1.4.1.13742.6.5.4.3.1.4
# .1.4.1.1 = INTEGER: 120    (outlet 1: 1.20 amps)
# .1.4.1.2 = INTEGER: 85     (outlet 2: 0.85 amps)
# .1.4.1.3 = INTEGER: 0      (outlet 3: empty)

# Check total rack power draw
snmpget -v2c -c public pdu-a.dc.local .1.3.6.1.4.1.13742.6.5.2.3.1.4.1
# INTEGER: 3250              (32.50 amps total)

UPS just went to battery — triage

# 1. Confirm UPS is on battery
upsc myups ups.status
# OB                    (on battery!)

# 2. Check remaining runtime
upsc myups battery.runtime
# 900                   (15 minutes)

# 3. Check input voltage (is utility truly out?)
upsc myups input.voltage
# 0.0                   (utility is dead)

# 4. Check if generator started
# (vendor-specific — often a relay contact or SNMP trap)
snmpget -v2c -c public ats-01.dc.local .1.3.6.1.4.1.318.1.1.8.5.1.2.0
# INTEGER: 2              (2 = source B / generator active)

# 5. If no generator — initiate graceful shutdown sequence

Graceful shutdown sequence

# Shutdown order: apps → VMs → hypervisors → storage → switches

# Step 1: Drain load balancer and stop applications
ssh web-01 'systemctl stop nginx'
ssh app-01 'systemctl stop myapp'

# Step 2: Shut down VMs
for vm in $(virsh list --name); do
    echo "Shutting down ${vm}..."
    virsh shutdown "${vm}"
done
sleep 60

# Step 3: Verify VMs are off
virsh list --all | grep -v "shut off"

# Step 4: Shut down hypervisors/bare metal
shutdown -h +1 "UPS battery low — emergency shutdown"

# Step 5: NUT automatic shutdown (preconfigured)
# /etc/nut/upsmon.conf:
# MONITOR myups@localhost 1 admin secret master
# SHUTDOWNCMD "/sbin/shutdown -h +0"
# FINALDELAY 5

Calculate rack power budget

# Count servers and their PSU ratings
ipmitool dcmi power reading          # per-server reading

# Quick rack audit
for host in rack12-{01..20}; do
    watts=$(ssh "$host" "ipmitool dcmi power reading 2>/dev/null | grep Instantaneous | awk '{print \$4}'" 2>/dev/null)
    echo "${host}: ${watts}W"
done
# rack12-01: 345W
# rack12-02: 280W
# ...
# Total: ~6,500W = 6.5 kW

# Check PDU capacity (typically 30A @ 208V = 6.24 kW per circuit)
# At 80% safe load: 4.99 kW per circuit (NEC 80% continuous load rule)
# Two PDUs: 9.98 kW available with N+1

Test NUT shutdown sequence (non-destructively)

# Verify upsmon is running and configured
systemctl status nut-monitor
upsc myups

# Dry-run: check what upsmon would do
grep SHUTDOWNCMD /etc/nut/upsmon.conf
# SHUTDOWNCMD "/sbin/shutdown -h +0"

# Test notification scripts
grep NOTIFYCMD /etc/nut/upsmon.conf
# NOTIFYCMD /usr/local/bin/notify-power-event.sh

# Simulate a low battery notification (test only — will NOT shutdown)
upsmon -c notify

War story: A datacenter tested generator failover monthly but never tested the ATS (Automatic Transfer Switch) under real load. When utility power failed, the ATS stuck in the transfer position and neither utility nor generator power reached the rack PDUs. 30 seconds of UPS runtime was not enough. Test the full power chain — UPS, ATS, and generator — under realistic load, not just idle.

Verify PSU redundancy

# Check both PSUs are present and healthy
ipmitool sensor list | grep -i psu
# PS1 Status       | 0x1        | discrete | ok    | Presence detected
# PS2 Status       | 0x1        | discrete | ok    | Presence detected

# Dell iDRAC
racadm get system.power.redundancypolicy
# RedundancyPolicy = 1+1

# What happens if we lose one PSU? Check wattage headroom
ipmitool dcmi power reading | grep -i "maximum\|instantaneous"
# Instantaneous: 345W
# Maximum:       510W
# Single PSU capacity is typically 750-1100W — plenty of headroom

Check for thermal issues affecting power

# High temps = fans spin faster = higher power draw
ipmitool sensor list | grep -iE "temp|fan"
# Inlet Temp       | 28.000     | degrees C | ok
# CPU1 Temp        | 72.000     | degrees C | ok
# Fan1             | 9600.000   | RPM       | ok    (high RPM = hot)

# If inlet temp is high, check CRAC/CRAH units
# If CPU temp is high with normal inlet, check for blocked airflow

Debug clue: Fan RPM and power draw are correlated. When fans spin at maximum RPM, power consumption jumps 50-100W per server. If your rack is hitting PDU limits during a heat event, it is because every server in the rack spun fans to max simultaneously. Budget power for worst-case thermal, not idle.

SNMP alerting for power events

# Test SNMP trap receiver for UPS events
snmptrap -v 2c -c public monitoring.dc.local '' \
    .1.3.6.1.4.1.318.0.5 \
    .1.3.6.1.4.1.318.0.5 s "UPS on battery"

# Prometheus NUT exporter check
curl -s http://localhost:9199/metrics | grep nut_ups
# nut_ups_status{ups="myups"} 2     (2=OL, 3=OB)
# nut_battery_charge{ups="myups"} 100

Gotcha: SNMP community strings on PDUs and UPSes are almost always left at the default public (read) and private (write). An attacker on the management VLAN can use snmpset with community private to remotely power off individual PDU outlets. Change community strings and restrict SNMP to your monitoring subnet with ACLs on the device.