Incident Replay: BMC Clock Skew Causes Certificate Failure¶

Setup¶

System context: Fleet of 50 servers managed via Redfish/iDRAC. TLS certificates for BMC web interfaces are issued by an internal CA with strict validity windows.
Time: Monday 09:12 UTC
Your role: Infrastructure engineer / on-call SRE

Round 1: Alert Fires¶

[Pressure cue: "Monitoring fires — 12 servers showing 'BMC unreachable via HTTPS' alerts. Management tools cannot pull hardware health data."]

What you see: Redfish API calls to a batch of servers return SSL: CERTIFICATE_VERIFY_FAILED. The BMC web UI shows a certificate warning in the browser. Servers themselves are running fine — this is BMC-only.

Choose your action: - A) Regenerate and push new TLS certificates to all affected BMCs - B) Check the BMC system clock on one affected server via IPMI - C) Disable TLS verification in the monitoring tool to restore visibility - D) Restart the BMC/iDRAC service on each affected server

If you chose A:¶

[Result: New certs are issued but the BMC clock is still wrong — the new certs also fail validation because the BMC thinks the "not before" date is in the future. Wasted 20 minutes.]

If you chose B (recommended):¶

[Result: ipmitool mc info shows the BMC clock is 3 days behind. The cert's "not before" timestamp is after the BMC's current time, causing the validation failure. Proceed to Round 2.]

If you chose C:¶

[Result: Monitoring resumes but you have lost certificate validation — a security regression. CISO flags this in the weekly review. Not a real fix.]

If you chose D:¶

[Result: BMC restarts but the clock is still wrong. Same cert error after restart. 5 minutes wasted per server.]

Round 2: First Triage Data¶

[Pressure cue: "The NOC is manually checking hardware health on 12 servers. They want an ETA for automated monitoring."]

What you see: BMC clocks on all 12 servers are skewed by 2-4 days behind. These servers were racked in the same batch last week. The BMC NTP configuration points to an internal NTP server.

Choose your action: - A) Manually set each BMC clock via ipmitool sel time set - B) Check if the BMC NTP server is reachable from the BMC management VLAN - C) Sync BMC time from the host OS clock via ipmitool mc selftest - D) Check the NTP server itself for issues

If you chose A:¶

[Result: Fixes the symptom on one server but does not prevent recurrence. You would need to do this for all 12 and it will drift again. Slow path.]

If you chose B (recommended):¶

[Result: ping from a BMC-VLAN host to the NTP server times out. A firewall rule change last Friday blocked UDP 123 from the BMC VLAN. Root cause identified. Proceed to Round 3.]

If you chose C:¶

[Result: BMC selftest does not sync the clock. You are looking at the wrong command. 10 minutes lost reading documentation.]

If you chose D:¶

[Result: NTP server is healthy, serving correct time to other VLANs. Narrows the problem to network path. Eventually leads to Round 2 data after 8 minutes.]

Round 3: Root Cause Identification¶

[Pressure cue: "Change management confirms a firewall rule was pushed Friday at 17:00 that tightened BMC VLAN egress. NTP was collateral damage."]

What you see: Root cause: firewall change blocked UDP 123 from BMC management VLAN to internal NTP servers. BMC clocks drifted, causing TLS certificate time validation to fail.

Choose your action: - A) Add a firewall rule to allow UDP 123 from BMC VLAN to NTP servers - B) Roll back the entire Friday firewall change - C) Configure BMCs to use a different NTP server on the BMC VLAN - D) Switch BMC certs to longer validity windows to tolerate drift

If you chose A (recommended):¶

[Result: Firewall team adds the allow rule. BMCs sync within 2 minutes. Certificates validate again. Proceed to Round 4.]

If you chose B:¶

[Result: Rolling back the entire change re-opens other ports that were intentionally closed. Security team objects. Overly broad fix.]

If you chose C:¶

[Result: No NTP server exists on the BMC VLAN. You would need to deploy one — hours of work for a configuration problem.]

If you chose D:¶

[Result: Does not fix the current issue and weakens certificate security posture. Band-aid, not a fix.]

Round 4: Remediation¶

[Pressure cue: "Monitoring is green. Document and close."]

Actions: 1. Verify all 12 BMC clocks are synced: for h in <hosts>; do ipmitool -H $h mc info; done 2. Confirm Redfish API calls succeed with TLS verification enabled 3. Add BMC NTP connectivity to the firewall change pre-check checklist 4. Add a monitoring check for BMC clock skew (alert if drift > 60 seconds) 5. File a change request to document the BMC VLAN NTP dependency

Damage Report¶

Total downtime: 0 (servers were fine; BMC management was degraded)
Blast radius: 12 servers lost out-of-band management visibility for ~3 days
Optimal resolution time: 15 minutes (check clock -> find firewall block -> add rule -> verify)
If every wrong choice was made: 60+ minutes plus security regressions from disabling TLS

Incident Replay: BMC Clock Skew Causes Certificate Failure¶

Setup¶

Round 1: Alert Fires¶

If you chose A:¶

If you chose B (recommended):¶

If you chose C:¶

If you chose D:¶

Round 2: First Triage Data¶

If you chose A:¶

If you chose B (recommended):¶

If you chose C:¶

If you chose D:¶

Round 3: Root Cause Identification¶

If you chose A (recommended):¶

If you chose B:¶

If you chose C:¶

If you chose D:¶

Round 4: Remediation¶

Damage Report¶

Cross-References¶

Pages that link here¶