AWS Route 53 Footguns¶

Mistakes that cause outages, failed migrations, silent failover failures, and DNS nightmares with Route 53.

1. CNAME at zone apex (the bare domain)¶

You try to create a CNAME record for example.com (no subdomain) pointing to your ALB. Route 53 rejects it. Standard DNS prohibits CNAME at the zone apex because CNAME means "this name is an alias for another name" — but the apex already has NS and SOA records, which would conflict.

Some people work around this by using a redirect service or by pointing an A record at a static IP. Both are fragile.

Fix: Use an alias record instead of CNAME. Alias records can be created at the zone apex, return the target's IP directly, and cost nothing for queries to AWS resources.

# Wrong: CNAME at apex (will fail)
# "Name": "example.com", "Type": "CNAME" ... REJECTED

# Right: alias record at apex
aws route53 change-resource-record-sets \
  --hosted-zone-id $ZONE_ID \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "example.com",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "Z35SXDOTRQ7X7K",
          "DNSName": "my-alb-1234.us-east-1.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        }
      }
    }]
  }'

2. TTL too long before migration¶

You are migrating DNS from another provider to Route 53. Your A records have a TTL of 86400 (24 hours). You import all records into Route 53, update the NS records at the registrar, and announce the migration is complete. For the next 24 hours, some users still resolve using the old provider's cached records. If you have already decommissioned the old infrastructure, those users get errors.

Fix: At least 48 hours before the migration, lower all TTLs at the old provider to 60-300 seconds. Wait for the old TTL to expire. Then make the NS change. After migration is confirmed, raise TTLs to production values.

# Timeline:
# T-48h: Lower TTLs to 60s at old provider
# T-24h: Verify TTLs are low everywhere (dig from multiple resolvers)
# T-0:   Update NS records at registrar
# T+2h:  Verify resolution from multiple locations
# T+24h: Raise TTLs to production values (300-3600s)

3. Health check endpoint returning 200 when unhealthy¶

Your health check monitors /health on your application. The application has a bug: even when the database is down, the health endpoint returns HTTP 200 with {"status": "degraded"}. Route 53 sees 200 and considers the endpoint healthy. Failover never triggers.

Fix: Your health endpoint must return a non-2xx status code (e.g., 503) when the service is unable to handle requests. Or use string matching:

# Health check with string matching
aws route53 create-health-check \
  --caller-reference "strict-hc-$(date +%s)" \
  --health-check-config '{
    "Type": "HTTPS_STR_MATCH",
    "FullyQualifiedDomainName": "app.example.com",
    "Port": 443,
    "ResourcePath": "/health",
    "SearchString": "\"status\":\"ok\"",
    "FailureThreshold": 3,
    "RequestInterval": 10
  }'
# Only healthy if response body contains exactly "status":"ok"

Test your failover before you need it. Intentionally break the health endpoint and verify that Route 53 switches to the secondary.

4. Private hosted zone not associated with VPC¶

You create a private hosted zone for internal.example.com and add records. EC2 instances in your VPC cannot resolve those records. You check the records — they exist. You check the zone — it is private. The problem: the private hosted zone is not associated with the VPC where the instances live.

# Check VPC associations
aws route53 get-hosted-zone --id Z0987654321XYZ \
  --query 'VPCs[].{Region:VPCRegion,VPC:VPCId}'

# If empty or missing your VPC:
aws route53 associate-vpc-with-hosted-zone \
  --hosted-zone-id Z0987654321XYZ \
  --vpc VPCRegion=us-east-1,VPCId=vpc-abc123

This also applies when you create a new VPC. Private hosted zones are not automatically associated with new VPCs — you must explicitly add each one.

5. NS record mismatch between registrar and hosted zone¶

You create a hosted zone in Route 53, which assigns four NS records. Then you delete the zone and recreate it (maybe to start fresh). The new zone gets a different set of NS records. But the registrar still has the old NS records pointing to the deleted zone. All DNS queries fail.

# Get NS records from Route 53
aws route53 get-hosted-zone --id $ZONE_ID \
  --query 'DelegationSet.NameServers[]' --output text

# Compare with what the registrar has
dig example.com NS +short

# These MUST match. If they do not, update the registrar.

Fix: After any hosted zone recreation, immediately check and update the NS records at the registrar. Keep a record of your NS delegation outside of Route 53.

6. Alias vs CNAME cost difference¶

You have a high-traffic domain serving 100 million queries per month. You use CNAME records to point to your ALB. Each CNAME query costs money ($0.40 per million = $40/month for this traffic). If you switched to alias records, queries to AWS resources are free.

At 100M queries/month: - CNAME: $40/month - Alias: $0/month

Fix: Always use alias records when the target is an AWS resource (ALB, CloudFront, S3, API Gateway, etc.). Reserve CNAME for non-AWS targets.

7. Route 53 not propagating — health check is the real problem¶

You update a weighted or failover record. You verify the record exists in the hosted zone. But dig keeps returning the old IP. You wait. You check TTLs. You flush your local DNS cache. Still the old IP.

The actual cause: the record set has a health check attached, and the health check is failing. Route 53 excludes unhealthy records from responses. The record is there, but Route 53 is not serving it because it considers the endpoint down.

# Check if the record has a health check
aws route53 list-resource-record-sets \
  --hosted-zone-id $ZONE_ID \
  --query "ResourceRecordSets[?Name=='app.example.com.'].HealthCheckId"

# Check health check status
aws route53 get-health-check-status --health-check-id hc-abc123 \
  --query 'HealthCheckObservations[].{Region:Region,Status:StatusReport.Status}'

# Use test-dns-answer to see what Route 53 would return
aws route53 test-dns-answer \
  --hosted-zone-id $ZONE_ID \
  --record-name app.example.com \
  --record-type A

Fix: Always check health check status when debugging routing policy records. The test-dns-answer API is your best friend — it shows exactly what Route 53 would return.

8. Overlapping private hosted zones¶

You create a private hosted zone for internal.example.com in Account A and another for internal.example.com in Account B. Both are associated with the same VPC (via cross-account association). Now the VPC has two private hosted zones for the same domain. Route 53 Resolver picks one non-deterministically. Some records resolve, others do not, depending on which zone is consulted.

Fix: Never associate two private hosted zones with the same name to the same VPC. Use a single zone and share it across accounts via cross-account VPC association. If you need per-account records, use subdomains: account-a.internal.example.com, account-b.internal.example.com.

# List all private hosted zones and their VPC associations
aws route53 list-hosted-zones \
  --query 'HostedZones[?Config.PrivateZone==`true`].{Name:Name,Id:Id}' --output table

# For each zone, check VPC associations
aws route53 get-hosted-zone --id Z_ZONE_ID \
  --query 'VPCs[].{Region:VPCRegion,VPC:VPCId}'

9. DNSSEC breaking resolvers¶

You enable DNSSEC signing on your hosted zone and create the DS record at the registrar. Everything works. Six months later, the KMS key used for signing is scheduled for deletion (someone ran aws kms schedule-key-deletion on it as part of a cleanup). The key is deleted. Route 53 can no longer sign responses. DNSSEC-validating resolvers now return SERVFAIL for your entire domain because the signatures are invalid.

Non-validating resolvers still work fine, which makes this a nightmare to debug: "it works for me but not for them."

Fix: Protect the KMS key used for DNSSEC signing. Apply a key policy that prevents deletion. Set up CloudWatch alarms for KMS key state changes. Before disabling DNSSEC, remove the DS record at the registrar first and wait for propagation (at least 48 hours based on the DS TTL).

# Check DNSSEC signing status
aws route53 get-dnssec --hosted-zone-id $ZONE_ID

# To safely disable DNSSEC:
# 1. Remove DS record at registrar
# 2. Wait 48+ hours for DS TTL to expire
# 3. Then disable DNSSEC signing in Route 53
aws route53 disable-hosted-zone-dnssec --hosted-zone-id $ZONE_ID

10. Health check in wrong region (or checking the wrong thing)¶

You create a health check that monitors your endpoint from Route 53's health check locations. But your endpoint is behind a WAF or security group that only allows traffic from specific IPs. The health checkers cannot reach the endpoint, so the check always fails, and failover is permanently triggered.

Route 53 health checks originate from known IP ranges in multiple AWS regions. You must allow these IPs.

# Get Route 53 health checker IP ranges
curl -s https://ip-ranges.amazonaws.com/ip-ranges.json | \
  jq -r '.prefixes[] | select(.service=="ROUTE53_HEALTHCHECKS") | .ip_prefix'

# Add these to your security group or WAF allowlist

Another variant: you set up a health check monitoring a specific region's ALB (us-east-1-alb.internal.example.com) but the health checkers run from all regions. If the health check uses a private/internal DNS name that only resolves inside the VPC, the health checkers cannot resolve it.

Fix: Health check endpoints must be publicly reachable from Route 53's health checker IPs. If you cannot expose the endpoint publicly, use a CloudWatch alarm-based health check instead.