Skip to content

The DNS Provider Switch

Category: The Migration Domains: dns, networking Read time: ~5 min


Setting the Scene

I was one of two SREs at a 200-person company running a SaaS product for real estate agents. Our DNS was on an old Dyn (now Oracle Cloud) account that had been set up by a co-founder who was no longer with the company. The account credentials were in a sticky note photo in a Slack DM from 2020. We were migrating to AWS Route 53 for tighter integration with our AWS infrastructure and because the Dyn contract renewal was 4x the previous price.

We had 6 public zones and what I thought was a simple migration. Export zones from Dyn, import to Route 53, update registrar nameservers. A weekend project.

What Happened

Friday 3:00 PM — I exported all 6 zones from Dyn as BIND zone files. Each one had 40-80 records — A records, CNAMEs, MX, TXT for SPF/DKIM, a few SRV records. I wrote a Python script to convert the BIND format to Route 53 aws route53 change-resource-record-sets JSON. Imported all 6 zones. Verified record counts matched. Everything looked clean.

Friday 5:00 PM — I updated the nameservers at our registrar (Namecheap) to point to Route 53's NS records. DNS propagation would take up to 48 hours, but most resolvers would pick up the change within a few hours. I went home feeling good.

Saturday 9:00 AM — Slack messages from the engineering team. "Our staging environment is down." And "Can't reach the internal wiki." And "Jenkins is returning 503s." I checked — all public-facing services were fine. The staging environment, wiki, and Jenkins were all internal services.

Saturday 9:15 AM — I realized what happened. We had a 7th zone. internal.ourcompany.com. It was a split-horizon DNS zone on Dyn that resolved internal service IPs (10.x.x.x addresses) for our office VPN and developer machines. It wasn't in any documentation. It wasn't in the migration plan. I didn't know it existed because I'd only looked at the zones in the Dyn web UI's "public zones" tab.

Saturday 9:30 AM — With the nameserver change, DNS queries for internal.ourcompany.com were now hitting Route 53, which had no such zone. NXDOMAIN for everything. Staging was at staging.internal.ourcompany.com. Jenkins was at ci.internal.ourcompany.com. The wiki was at wiki.internal.ourcompany.com. All dead.

Saturday 10:00 AM — I logged into the Dyn account (sticky note photo still in Slack, thankfully) and exported the internal zone. 94 records. All A records pointing to RFC 1918 addresses. I imported it to Route 53 as a public zone, which felt wrong but worked — the 10.x.x.x addresses would only resolve from machines that could reach those IPs anyway.

Saturday 10:30 AM — Internal services recovered as DNS propagated. But three developers had already hardcoded IP addresses into their /etc/hosts files as a workaround, which caused confusion for another week when those IPs changed during an unrelated maintenance.

Sunday — I used dig +trace and wrote a zone comparison script: query every record from both Dyn (via cached resolvers that still had the old NS) and Route 53, diff the results. Found two more discrepancies — a CNAME I'd imported with the wrong TTL (300 instead of 3600) and an MX record with a missing trailing dot that Route 53 silently fixed but Dyn had served broken.

The Moment of Truth

Saturday morning, staring at dig ci.internal.ourcompany.com returning NXDOMAIN and realizing there was an entire DNS zone I didn't know about. The internal zone had been set up three years ago by someone who'd since left. It wasn't in the Dyn dashboard's main view. It wasn't in our runbooks. 94 records, invisible until they broke.

The Aftermath

Everything stabilized by Monday. I documented all 7 zones in our infrastructure wiki with record counts, owners, and purpose. I set up a weekly cron that ran dig against every record in every zone and compared the results to a known-good baseline. Any drift generated a Slack alert. Six months later, it caught a developer who'd accidentally deleted an MX record through the Route 53 console.

The Lessons

  1. Audit ALL zones before migrating: Don't trust the UI. Don't trust the documentation. Export everything, including internal, private, and split-horizon zones. Use the provider's API to list all zones programmatically.
  2. Automated zone diffing: Before and after cutover, compare every record between old and new providers. dig both, diff the output. Do this for at least a week after migration.
  3. Shadow traffic before cutover: Point a test resolver at the new nameservers and run your full application test suite against it before changing the registrar. Catch missing zones in testing, not in production.

What I'd Do Differently

I'd use dyn-cli or the Dyn API to list every zone on the account, not rely on the web UI. I'd run a DNS discovery audit using netstat and application configs to find every hostname our infrastructure actually resolves — then cross-reference against the zone list. And I'd set up Route 53 health checks for every critical hostname before cutting over, so I'd get an alert within 60 seconds of a resolution failure instead of finding out from Slack.

The Quote

"The zone that breaks your migration is the one that's not in the migration doc."

Cross-References