The Single Point of Failure¶

Category: The Hard Lesson Domains: architecture, high-availability Read time: ~5 min

Setting the Scene¶

We had a server called legacy-core-01. It was a Dell PowerEdge R730, 4 years old, running in rack 14 of our on-prem datacenter. It ran the LDAP directory, the internal DNS resolver, the NTP server, the IPAM database, and the deployment orchestrator. Five critical services on one box. "It's never gone down," our infrastructure lead said every time someone suggested we should build redundancy. The HA project kept getting deprioritized because the server kept running.

The uptime counter read 1,247 days. We were proud of that number. We shouldn't have been.

What Happened¶

On a Wednesday morning at 9:14 AM, the RAID controller on legacy-core-01 suffered a firmware bug that caused it to mark all drives in the RAID-6 array as failed simultaneously. The server kernel-panicked and went dark. The iDRAC was reachable but the OS wouldn't boot — it couldn't find the root filesystem.

Within 30 seconds, everything started cascading. Internal DNS stopped resolving. Services that depended on DNS for service discovery started throwing connection errors. The deployment orchestrator was gone, so we couldn't push config changes. LDAP was down, so engineers couldn't SSH into servers — their authentication depended on LDAP groups. NTP was gone, so Kerberos tickets started failing on servers with drifting clocks.

We couldn't even get into the datacenter management portal because it authenticated against the same LDAP. Someone had to physically drive to the datacenter with a local admin password that was written on a sticky note in our team lead's desk drawer.

By 10:00 AM, we had lost authentication, DNS, time synchronization, IP address management, and deployment capabilities. We were managing infrastructure through console cables and static IPs like it was 2003.

The hardware team replaced the RAID controller at 2:30 PM. The array rebuilt from parity over the next three hours. We were fully operational at 9:47 PM — 12 hours and 33 minutes after the initial failure.

The Moment of Truth¶

During the outage, our CTO asked me for the business continuity plan for legacy-core-01. I said we didn't have one. He asked what the disaster recovery plan was. I said "fix the server." He asked what happens if the server is unrecoverable. I didn't have an answer.

The Aftermath¶

We spent the next quarter splitting those five services across redundant pairs. LDAP got a multi-master replica. DNS moved to three resolvers behind a VIP. NTP was distributed to four stratum-2 servers. The deployment orchestrator moved to a Kubernetes cluster. Total cost: about $45,000 in hardware and 6 weeks of engineering time. The 12-hour outage had cost an estimated $380,000 in lost productivity across 200 engineers.

The Lessons¶

If it hasn't failed, you haven't been running it long enough: High uptime doesn't mean high reliability. It means you've been lucky, and luck is not an architecture.
HA is insurance, not luxury: Redundancy feels expensive until you calculate the cost of the outage. Then it feels like the bargain of the century.
The cost of downtime always exceeds the cost of redundancy: $45K in HA infrastructure vs. $380K in one outage. The math was never hard — we just never did it.

What I'd Do Differently¶

I'd run a "what if this server dies right now" exercise for every piece of infrastructure, starting with the oldest and most loaded. I'd catalog every service's single points of failure in a spreadsheet with columns for "blast radius" and "recovery time." Anything with a blast radius above 50 users or recovery time above 1 hour gets an HA project in the next quarter. No exceptions, no deprioritization.

The Quote¶

"1,247 days of uptime wasn't a success metric. It was a measure of how long we'd been gambling."

Cross-References¶

Topic Packs: SRE Practices
Case Studies: Node NotReady NIC Firmware