The Datacenter Exit¶
Category: The Migration Domains: datacenter, cloud-ops Read time: ~5 min
Setting the Scene¶
I was the infrastructure lead at a media company when the landlord dropped the bomb: our datacenter lease wasn't being renewed. We had six months to vacate. The facility was a converted warehouse in New Jersey with 400 servers across 22 racks, 150 services, and a 10Gbps uplink. Some of those servers had been running since 2016. A few had handwritten labels. One just said "DO NOT TOUCH — FRANK."
Frank left in 2019.
We chose AWS as the destination. Our CTO gave me a team of four and said, "Make it happen." The project plan looked great in a slide deck: 30 services per month, phased cutover, parallel running. The reality was different.
What Happened¶
Month 1 — Inventory. We used nmap to scan every subnet and found 487 IP addresses responding, not the 400 servers in our asset spreadsheet. 87 mystery IPs. Some were IPMI interfaces. Some were network gear. Twelve were VMs on hypervisors nobody remembered provisioning. Three were physical servers in a rack labeled "decomm" that were still serving production traffic.
Month 2 — We migrated the first 40 services — all stateless web frontends. Used AWS Application Migration Service (formerly CloudEndure) for the lift-and-shift. Tested each one with synthetic traffic, cut DNS, monitored for a week. This was the easy part and we knew it.
Month 3 — Dependencies started biting. Service A talked to Service B over a hardcoded 10.0.x.x IP that didn't exist in AWS VPC. We set up a site-to-site VPN with a Direct Connect backup for the transition period. Latency across the VPN was 8ms — fine for APIs, brutal for the service that did 200 sequential database queries per request. That service couldn't migrate until its database did.
Month 4 — The database tier. Two PostgreSQL clusters (primary-replica, 2TB each), one MySQL instance (600GB), one Redis cluster (128GB), and a Solr index that nobody wanted to own. We used AWS DMS for PostgreSQL with CDC replication. The MySQL migration was clean. The Solr index had to be rebuilt from scratch because the schema had been modified through the admin UI and nobody had exported the config.
Month 5 — The long tail. A service that talked to a hardware security module (HSM) bolted into rack 14. A legacy print queue server that three clients still used. An FTP server receiving EDI files from a partner who refused to switch to SFTP. A custom SNMP monitoring box running Cacti. Each one was a negotiation: migrate, replace, retire, or get an exception.
Month 6 — Panic mode. 18 services still in the datacenter. We extended the lease by 45 days at 3x the monthly rate. The last service migrated on day 38 of the extension. It was Frank's server. It ran a Perl script that reconciled invoices. We rewrote it in Python in two days. We should have done that in month 1.
The Moment of Truth¶
Month 3, discovering that our "150 services" were actually 150 services plus 40 undocumented dependencies, 12 phantom VMs, and a hardware appliance that required a forklift to remove. The project plan assumed we knew what we had. We didn't. The inventory phase should have been month 1 AND month 2.
The Aftermath¶
We vacated the datacenter on day 43 of the 45-day extension. The decommissioning crew found two servers we'd missed — they were powered off and behind a cable management panel. Neither was running anything, thankfully. Our AWS bill was higher than expected (see: "The Cloud Bill Surprise"), but we no longer had a $45,000/month cage lease. Total migration cost including the lease extension, contractor help, and Direct Connect: about $280,000.
The Lessons¶
- Unknown dependencies are the real risk: You don't know what you have until you scan everything. Asset spreadsheets are always wrong. Network scans and traffic analysis tell the truth.
- Legacy services have no documentation: If the person who built it is gone, budget 3x the time. You'll spend more time understanding it than migrating it.
- Always have a "long tail" budget: The last 10% of services take 40% of the time and budget. Plan for a lease extension, a cleanup crew, and services that resist migration.
What I'd Do Differently¶
I'd spend the entire first month on discovery: nmap scans, traffic captures with tcpdump, dependency mapping with netstat output from every server. I'd build a dependency graph before writing a single migration plan. And I'd identify "Frank's server" equivalents on day 1 — the undocumented, single-person-knowledge services — and either rewrite them early or negotiate their retirement with stakeholders before the deadline pressure hits.
The Quote¶
"The asset spreadsheet said 400 servers. The network said 487. The network was right."
Cross-References¶
- Topic Packs: Datacenter, Cloud Ops Basics, Legacy Archaeology
- Case Studies: Datacenter Ops