Skip to content

DNS: The Eternal Enemy

Category: The Incident Domains: dns, networking Read time: ~5 min


Setting the Scene

I was a junior SRE at a healthcare SaaS company, about 500 employees. We were migrating our main patient portal from an aging datacenter to AWS. The migration had been planned for months — new infrastructure was tested, load-tested, security-reviewed. The cutover plan was simple: update DNS A records from the old datacenter IPs to the new ALB. We scheduled it for Saturday at 6 AM to minimize impact. I volunteered to run the migration because I wanted the experience. I got it.

What Happened

Saturday 6:00 AM — I update the A records in Route 53. TTL on the old records was 86400 seconds. That's 24 hours. I didn't check this beforehand. Nobody checked this beforehand.

6:05 AM — I verify from my laptop: dig portal.healthco.example.com returns the new IPs. Great, it's working! I post in Slack: "Migration complete, DNS propagated, all looks good." I am an idiot.

6:15 AM — My laptop resolves the new IPs because I'd been hitting the portal all week during testing, so my local resolver had expired its cache and fetched fresh. Everyone else in the world still has the old IPs cached for up to 24 hours.

8:30 AM — First customer ticket: "portal is down." Then another. Then forty more. I check — the old datacenter is still running, so customers hitting old IPs are getting the old app, which we'd already pointed at read-only database replicas as part of the migration. They can log in but can't save anything. Some users are hitting the new infra and working fine. It's chaos.

9:00 AM — My manager asks why some customers work and others don't. I realize the TTL problem. We can't lower the TTL retroactively — resolvers already have the old records cached. We can only wait.

9:30 AM — We scramble to restore write access on the old datacenter by pointing it back at the primary database. Now we have two live environments writing to the same database. This works but it's terrifying.

6:00 PM — Twelve hours later, most resolvers have refreshed. By Sunday morning, traffic to old IPs drops to near zero. We finally decommission the old environment Monday morning.

The Moment of Truth

When I ran dig from my machine and saw the right answer, I declared victory. I verified from inside my own bubble. The lesson branded itself into my brain: DNS changes must be verified from outside your network, from multiple geographic locations, and you must understand what TTL the rest of the world is working with.

The Aftermath

We wrote a pre-migration checklist that starts with "lower TTL to 300 seconds, 48 hours before migration." We added external DNS verification using a check from multiple regions via dig @8.8.8.8, dig @1.1.1.1, and a quick script hitting DNS from three AWS regions. I also learned that "migration complete" is never something you say five minutes after changing DNS.

The Lessons

  1. Lower TTL before migration: At least 48 hours before a DNS cutover, drop TTL to 60-300 seconds. This is non-negotiable.
  2. DNS propagation is not instant: Resolvers worldwide cache based on TTL. Your local resolver is not representative of anything but your own machine.
  3. Verify from outside your network: Use external resolvers, check from multiple regions, and don't trust a single dig result from your laptop.

What I'd Do Differently

I'd build a pre-migration verification script that checks the current TTL of every record being changed and refuses to proceed if any TTL is above 300 seconds. I'd also run a blue-green DNS pattern — keeping both old and new endpoints fully functional and writeable for at least 48 hours after the cutover, with a clear rollback path.

The Quote

"It's not DNS. There's no way it's DNS. It was DNS."

Cross-References