Thinking Out Loud: DNS Ops¶

A senior SRE's internal monologue while working through a real DNS issue. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation¶

Users in the EU region report intermittent "site not found" errors when accessing our API at api.example.com. US users are unaffected. The errors started about 20 minutes ago. No DNS changes were made today.

The Monologue¶

Intermittent DNS failures in one region but not another. That rules out a total DNS zone failure. This is either: a regional resolver issue, a TTL-related propagation problem, or a delegation/DNSSEC issue affecting specific nameservers.

Let me first verify what EU users are actually seeing by testing from multiple locations.

dig api.example.com @8.8.8.8 +short
dig api.example.com @1.1.1.1 +short

Both return our IP. But that's from my location (US). Let me check from EU resolver perspective. If we use Route 53 with latency-based routing, EU resolvers should be hitting different nameservers.

dig api.example.com @ns-1234.awsdns-12.co.uk +short
dig api.example.com @ns-567.awsdns-34.net +short

The UK nameserver returns the correct IP. The .net nameserver returns... SERVFAIL. There it is. One of our authoritative nameservers is returning SERVFAIL. Users whose resolvers happen to query that nameserver get failures.

Mental Model: DNS Failure is Probabilistic¶

When you have multiple authoritative nameservers and one is failing, users experience intermittent failures — not total outage. The resolver randomly picks a nameserver for each query. If 1 of 4 nameservers is broken, roughly 25% of queries fail. This makes DNS issues notoriously hard to reproduce because "it works for me" depends on which nameserver your resolver picks.

SERVFAIL from one nameserver. Let me check if this is a Route 53 issue or something else. Let me look at the SOA and NS records.

dig api.example.com SOA +trace

The trace shows the delegation is correct. All four NS records point to Route 53 nameservers. But one is returning SERVFAIL. Let me check if it's a Route 53 health check issue.

aws route53 get-hosted-zone --id Z1234567890ABC --output json | jq '.HostedZone'
aws route53 list-resource-record-sets --hosted-zone-id Z1234567890ABC --query "ResourceRecordSets[?Name=='api.example.com.']" --output json

The record is a weighted routing policy with health checks. Two endpoints: us-east-1 and eu-west-1. Let me check the health check status.

aws route53 get-health-check-status --health-check-id hc-eu-west-12345 --output json | jq '.HealthCheckObservations[].StatusReport'

The EU health check is reporting the target as unhealthy. "Connection timed out." So Route 53 thinks the EU endpoint is down and is removing it from DNS responses. But the health check is failing, not the service itself.

Let me check if the EU service is actually up.

curl -v --connect-timeout 5 https://eu-api.example.com/health

200 OK. The service is fine. So why is the Route 53 health checker failing? The health checker runs from AWS IP ranges. Let me check if there's a firewall or WAF rule blocking AWS health checker IPs.

aws route53 get-health-check --health-check-id hc-eu-west-12345 --output json | jq '.HealthCheck.HealthCheckConfig'

The health check is hitting eu-api.example.com on port 443, path /health. Request interval: 30 seconds. Failure threshold: 3. So after 3 consecutive failures (90 seconds), the endpoint is marked unhealthy.

Mental Model: The Health Check Paradox¶

Health checks protect against real failures but can also CAUSE apparent failures. If a firewall, WAF, rate limiter, or network ACL blocks the health checker, the healthy endpoint gets removed from DNS, causing an outage of a perfectly working service. Always allowlist health checker IP ranges and test the health check path independently.

I bet someone added a WAF rule. Let me check.

aws wafv2 list-web-acls --scope REGIONAL --region eu-west-1 --query 'WebACLs[*].Name'

There's a WAF attached to the EU ALB. Let me check recent rule changes.

aws wafv2 get-web-acl --name eu-api-waf --scope REGIONAL --region eu-west-1 --id $(aws wafv2 list-web-acls --scope REGIONAL --region eu-west-1 --query 'WebACLs[?Name==`eu-api-waf`].Id' --output text) --query 'WebACL.Rules[*].{Name:Name,Action:Action}' --output table

There's a rate-limiting rule that was added... let me check the CloudWatch metrics for blocked requests.

aws cloudwatch get-metric-statistics --namespace "AWS/WAFV2" --metric-name BlockedRequests --dimensions Name=WebACL,Value=eu-api-waf Name=Rule,Value=rate-limit-rule --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date -u +%Y-%m-%dT%H:%M:%S) --period 300 --statistics Sum --region eu-west-1

Blocked requests spiked 25 minutes ago. The rate-limit rule is blocking the Route 53 health checkers because they come from the same IP ranges as other AWS traffic that's being rate-limited. Classic.

The fix: add an allowlist rule for Route 53 health checker IPs that evaluates BEFORE the rate-limit rule. Route 53 publishes its health checker IP ranges.

curl -s https://ip-ranges.amazonaws.com/ip-ranges.json | jq '.prefixes[] | select(.service=="ROUTE53_HEALTHCHECKS") | .ip_prefix' | head -10

I'll create an IP set with these ranges and add a WAF rule to allow them. But for the immediate fix, I can disable the rate-limit rule temporarily to restore the health check.

Actually, the fastest fix is to switch the health check type from HTTP to TCP. TCP health checks are simpler and less likely to be caught by WAF rules.

aws route53 update-health-check --health-check-id hc-eu-west-12345 --port 443 --resource-path "" --health-check-type TCP 2>/dev/null || echo "Can't change type on existing check"

Can't change the type on an existing check. Let me just add the IP allowlist to the WAF. That's the proper fix anyway.

While I set that up, let me also manually set the health check to healthy to restore DNS immediately.

Actually, I can't force a Route 53 health check. But I can temporarily change the routing policy to remove the health check association.

# Quick fix: remove health check requirement to restore DNS immediately
aws route53 change-resource-record-sets --hosted-zone-id Z1234567890ABC --change-batch '{
  "Changes": [{
    "Action": "UPSERT",
    "ResourceRecordSet": {
      "Name": "api.example.com",
      "Type": "A",
      "SetIdentifier": "eu-west-1",
      "Weight": 50,
      "TTL": 60,
      "ResourceRecords": [{"Value": "52.18.xx.xx"}]
    }
  }]
}'

DNS propagation will take 60 seconds (TTL). EU users should start resolving correctly within a few minutes. Now let me properly fix the WAF while DNS recovers.

What Made This Senior-Level¶

Junior Would...	Senior Does...	Why
Test from their own machine and say "DNS works for me"	Test against each authoritative nameserver individually	"It works for me" depends on which nameserver the resolver picks — you must test all of them
Investigate the application when users report "site not found"	Recognize that "site not found" is a DNS resolution failure, not an application error	The error message tells you the resolution layer, not the service layer
Not connect the WAF rate-limit rule to the DNS health check failure	Trace the chain: WAF blocks health checker -> health check fails -> DNS record removed -> users get NXDOMAIN	DNS health check failures can be caused by infrastructure far from DNS
Only fix the WAF rule	Restore DNS immediately (remove health check) AND fix the WAF rule	Fix the symptom first (restore DNS for users), then fix the root cause (WAF allowlist)

Key Heuristics Used¶

DNS Failure is Probabilistic: With multiple nameservers, a single failing NS causes intermittent failures proportional to 1/N. Test each NS individually.
Health Check Paradox: A health check protecting an endpoint can also cause its removal from DNS if the checker itself is blocked. Always allowlist health checker IPs.
Trace the Full Chain: DNS issues often have root causes in unexpected places (WAF, firewall, certificate expiry, rate limits). Follow the dependency chain.

Cross-References¶

Primer — DNS architecture, resolution flow, and Route 53 routing policies
Street Ops — DNS debugging with dig, trace, and multi-resolver testing
Footguns — WAF blocking health checkers, TTL propagation delays, and DNSSEC validation failures