DNS Deep Dive - Footguns¶

Mistakes that silently break name resolution, corrupt caches, cause outages during migrations, and make "it's always DNS" the truest sentence in operations.

1. CNAME at the zone apex¶

You want example.com (no subdomain) to point at your CDN or load balancer. You add example.com. IN CNAME mycdn.example.net. and BIND refuses to load the zone, or worse, some resolvers accept it and others reject it.

The DNS RFC forbids CNAME records at the zone apex because the apex must also have SOA and NS records, and a CNAME means "this name is an alias — ignore all other records for this name." SOA + CNAME = contradiction.

Fix: Use your DNS provider's ALIAS or ANAME record type (supported by Route 53, Cloudflare, DNSimple, NS1). These synthesize an A record from the CNAME target at query time. If your provider does not support ALIAS records, use A records pointing to static IPs, or move to a provider that does.

2. TTL too long before a migration¶

Your records have TTL 86400 (24 hours). You change the IP at 9 AM on Tuesday. Some ISP resolvers cached the old IP at 8:59 AM. Those clients hit the old server for up to 24 hours. Your monitoring shows the migration succeeded because your checks use the authoritative server, but customer complaints roll in all day.

Fix: Lower TTL to 60 seconds at least 48 hours before any planned change. The wait is essential — you need the old high-TTL cached entries to expire. Then make the change, verify, and raise TTL back to normal after 24-48 hours of stability. Budget this into your migration plan timeline, not as an afterthought.

Gotcha: Some recursive resolvers (notably older versions of dnsmasq and certain ISP resolvers) enforce a minimum TTL floor, ignoring your TTL if it is below their threshold. A TTL of 60s may be cached for 300s by these resolvers. You cannot force all resolvers to respect low TTLs — plan for stragglers even after the TTL window.

3. TTL too short causes a query storm¶

You set all your records to TTL 5 because you want fast propagation. Your authoritative nameservers now handle every single query from every resolver, every 5 seconds, for every client. During a traffic spike, your authoritative servers get overwhelmed. DNS latency spikes. Resolution timeouts cause application-level failures across your entire infrastructure.

Fix: Use TTL 300 (5 minutes) as a reasonable default. Only lower TTL for records you need to change frequently or during planned migrations. Monitor query volume on your authoritative servers. If you need sub-minute failover, use DNS health checks and failover routing (Route 53, Cloudflare) rather than ultra-low TTLs.

4. CNAME and MX/NS on the same name¶

You have mail.example.com. IN CNAME hosted-mail.provider.com. and then add mail.example.com. IN MX 10 mail.example.com.. The CNAME makes BIND ignore the MX record (or refuse the zone entirely). Mail delivery breaks because the MX target resolves through a CNAME chain that some mail servers refuse to follow.

Fix: Never combine CNAME with any other record type for the same name. If a name needs MX, NS, TXT, or any other record, it cannot be a CNAME. Use A records instead, or restructure your naming.

5. Missing reverse DNS breaks email deliverability¶

Your mail server sends email from 203.0.113.50. Recipients' mail servers do a reverse lookup (dig -x 203.0.113.50) and get nothing (NXDOMAIN) or a generic ISP hostname. The receiving server flags your email as suspicious. SPF passes, DKIM passes, but the mail still lands in spam because the PTR record is missing or mismatched.

Fix: Ensure forward and reverse DNS match. If mail.example.com resolves to 203.0.113.50, then dig -x 203.0.113.50 should return mail.example.com. PTR records are managed by whoever controls the IP block — usually your hosting provider or ISP. Request a PTR record from them. For cloud providers, set the reverse DNS through their console (EC2 elastic IP settings, GCP external IP settings).

6. /etc/resolv.conf overwritten by DHCP or NetworkManager¶

You manually edit /etc/resolv.conf to point at your internal DNS servers. An hour later, DHCP renewal runs, or NetworkManager restarts, and your changes are overwritten with the DHCP-provided nameservers. Your applications start resolving through the wrong DNS servers. Internal names break.

Fix: Never hand-edit /etc/resolv.conf on systems managed by systemd-resolved, NetworkManager, or DHCP clients. Configure DNS through the manager: - systemd-resolved: /etc/systemd/resolved.conf - NetworkManager: nmcli con mod <conn> ipv4.dns "10.0.1.10" or edit /etc/NetworkManager/conf.d/ - DHCP: Use supersede or prepend directives in /etc/dhcp/dhclient.conf

If you need manual control, configure NetworkManager with dns=none or use chattr +i /etc/resolv.conf as a fragile last resort.

7. Search domain appending causes wrong resolution¶

Your /etc/resolv.conf has search example.com internal.corp. You look up api in your code. The resolver first tries api.example.com and finds a record — but it is the wrong api. You wanted the Kubernetes service api in your namespace, but you got the external api.example.com instead. Traffic goes to the wrong service, potentially leaking data or causing authentication failures.

Fix: Use fully qualified domain names (with trailing dot in DNS tools, or full name in application code). Understand the ndots setting — with ndots:1, any name containing a dot is queried as-is first. Review your search domain list and ensure it does not create ambiguous resolution. In Kubernetes pods, use the full <service>.<namespace>.svc.cluster.local form for cross-namespace calls.

8. DNS caching hides configuration changes¶

You update a record in your authoritative DNS server. You test with dig @ns1.example.com and it shows the new value. You tell the team the change is live. But applications are still using the old value because the recursive resolver, systemd-resolved, nscd, the JVM, and the browser all have their own caches, each with their own TTL interpretation.

Fix: After making DNS changes, flush caches at every layer you control: 1. Authoritative server (confirm the change is served) 2. Recursive resolver cache (rndc flush or unbound-control flush_zone) 3. Local system cache (resolvectl flush-caches or nscd -i hosts) 4. Application cache (restart JVM, clear browser cache)

Test from a client, not from the server. Use dig @8.8.8.8 +short to see what public resolvers serve.

9. Kubernetes ndots:5 causes excessive DNS lookups¶

Kubernetes sets ndots:5 in pod resolv.conf by default. Any name with fewer than 5 dots gets the search domain list appended first. A lookup for api.github.com (2 dots, less than 5) tries api.github.com.default.svc.cluster.local, api.github.com.svc.cluster.local, api.github.com.cluster.local before finally trying api.github.com. as a FQDN. That is 3 wasted NXDOMAIN queries per external lookup.

For a service making 1,000 external DNS calls per second, this means 3,000 extra queries hitting CoreDNS. In large clusters, this can overwhelm CoreDNS, cause latency spikes, and trigger cascading timeouts.

Fix: For pods that primarily call external services:

spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"

Or append a trailing dot to external hostnames in your application code (api.github.com.). Do not set ndots to 0 or 1 cluster-wide — that would break short-name Kubernetes service resolution.

10. DNSSEC breaking resolution when misconfigured¶

You enable DNSSEC on your zone but make a mistake: the DS record in the parent zone does not match your KSK, or you let a signature expire, or you rotate keys without updating the DS record at the registrar. DNSSEC-validating resolvers (8.8.8.8, 1.1.1.1, and many ISPs) now return SERVFAIL for every query to your domain. Your entire zone is unreachable from any resolver that validates DNSSEC.

Non-validating resolvers still work, which makes debugging confusing — "it works for me" while half the internet cannot reach you.

Fix: Before enabling DNSSEC: - Use managed DNSSEC (Route 53, Cloudflare) rather than manual key management - Test with dig +dnssec +cd (checking disabled) to compare validated vs unvalidated responses - Monitor for DNSSEC signature expiry - Have a plan to disable DNSSEC quickly (remove DS record from parent zone) if something goes wrong - After enabling, verify from multiple DNSSEC-validating resolvers: dig @8.8.8.8 +dnssec example.com

11. Wildcard DNS catching unintended subdomains¶

You add *.example.com. IN A 10.0.2.100 so that any subdomain resolves to your web server. This catches everything — including typo.example.com, internal-service.example.com, and _acme-challenge.example.com. Let's Encrypt HTTP-01 validation for a subdomain might hit the wildcard instead of the correct server. Internal service names that should not resolve externally now do.

Fix: Use wildcard records only when you genuinely want every possible subdomain to resolve to the same target. For most setups, explicit records are safer. If you must use wildcards, be aware that: - Explicit records take precedence over wildcards (an A record for app.example.com overrides *.example.com) - Wildcards do not match multiple levels (*.example.com matches foo.example.com but not bar.foo.example.com) - Wildcards can interfere with ACME DNS-01 challenges, service discovery, and security scanning

12. Split-horizon DNS causing debugging confusion¶

You have split-horizon DNS: internal clients get 10.0.2.100 for app.example.com, external clients get 203.0.113.50. A developer on VPN reports that app.example.com is down. You check from your desk (also internal) and it works. You check from an external server and it works. The developer's VPN routes DNS through a different path that hits the external view, getting the public IP, which is not reachable from the VPN subnet.

Fix: When debugging split-horizon issues: 1. Always ask "which DNS resolver is the client using?" (cat /etc/resolv.conf or resolvectl status) 2. Test against multiple resolvers: dig @internal-dns, dig @external-dns, dig @8.8.8.8 3. Check BIND view match-clients to verify which source IPs get which view 4. Document your split-horizon setup so on-call engineers know it exists 5. VPN configurations need special attention — ensure VPN clients use the internal resolver for internal domains

Debug clue: dig +short @<resolver-ip> example.com lets you test against a specific resolver. When debugging split-horizon, test against at least three: your internal resolver, your external resolver, and a public resolver (8.8.8.8). If all three return different answers, you have confirmed split-horizon is in play and can trace which view each client is hitting.