Skip to content

Anti-Primer: AWS Route53

Everything that can go wrong, will — and in this story, it does.

The Setup

The team is migrating DNS from an old provider to Route 53 during a planned maintenance window. The domain serves 10 million requests per day. The migration plan assumes DNS propagation is instant.

The Timeline

Hour 0: TTL Not Lowered Before Migration

Migrates DNS records without first lowering TTLs from 86400 to 300. The deadline was looming, and this seemed like the fastest path forward. But the result is stale records cached worldwide for 24 hours; half the traffic goes to the old servers.

Footgun #1: TTL Not Lowered Before Migration — migrates DNS records without first lowering TTLs from 86400 to 300, leading to stale records cached worldwide for 24 hours; half the traffic goes to the old servers.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: Missing Alias vs CNAME Distinction

Creates a CNAME for the zone apex (bare domain). Under time pressure, the team chose speed over caution. But the result is route 53 rejects it; the engineer creates an A record pointing to a single IP instead.

Footgun #2: Missing Alias vs CNAME Distinction — creates a CNAME for the zone apex (bare domain), leading to route 53 rejects it; the engineer creates an A record pointing to a single IP instead.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Health Check Not Configured

Sets up failover routing without configuring health checks on the primary. Nobody pushed back because the shortcut looked harmless in the moment. But the result is primary goes down but Route 53 never fails over; all traffic hits a dead endpoint.

Footgun #3: Health Check Not Configured — sets up failover routing without configuring health checks on the primary, leading to primary goes down but Route 53 never fails over; all traffic hits a dead endpoint.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: Hosted Zone ID Mismatch

Updates the domain's NS records to point at the wrong hosted zone. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is all DNS resolution fails; the domain is effectively offline for the TTL duration.

Footgun #4: Hosted Zone ID Mismatch — updates the domain's NS records to point at the wrong hosted zone, leading to all DNS resolution fails; the domain is effectively offline for the TTL duration.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 TTL Not Lowered Before Migration Stale records cached worldwide for 24 hours; half the traffic goes to the old servers Primer: Lower TTLs days before migration; wait for old TTL to expire
2 Missing Alias vs CNAME Distinction Route 53 rejects it; the engineer creates an A record pointing to a single IP instead Primer: Use Alias records for zone apex; they work like CNAMEs but are allowed at apex
3 Health Check Not Configured Primary goes down but Route 53 never fails over; all traffic hits a dead endpoint Primer: Health checks are required for failover routing policies
4 Hosted Zone ID Mismatch All DNS resolution fails; the domain is effectively offline for the TTL duration Primer: Verify hosted zone NS records match the domain registrar delegation

Damage Report

  • Downtime: 3-6 hours of degraded or unavailable cloud services
  • Data loss: Possible if storage or database resources were affected
  • Customer impact: API errors, failed transactions, or service unavailability for end users
  • Engineering time to remediate: 12-24 engineer-hours across incident response, root cause analysis, and remediation
  • Reputation cost: Internal trust erosion; potential AWS billing surprises; customer-facing impact report required

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on ttl not lowered before migration, they would have learned: Lower TTLs days before migration; wait for old TTL to expire.
  • Footgun #2: If the engineer had read the primer, section on missing alias vs cname distinction, they would have learned: Use Alias records for zone apex; they work like CNAMEs but are allowed at apex.
  • Footgun #3: If the engineer had read the primer, section on health check not configured, they would have learned: Health checks are required for failover routing policies.
  • Footgun #4: If the engineer had read the primer, section on hosted zone id mismatch, they would have learned: Verify hosted zone NS records match the domain registrar delegation.

Cross-References