DNS Operations Footguns¶

Mistakes that break name resolution and make everything look like an application failure.

1. Forgetting the trailing dot in zone files¶

You add www IN CNAME app.example.com to the zone file for example.com. Without the trailing dot, BIND interprets this as app.example.com.example.com. — a name that does not exist. Your CNAME resolves to nothing and www goes down.

Fix: Always use trailing dots for fully qualified domain names in zone files. www IN CNAME app.example.com. (with the dot). Run named-checkzone after every edit.

Remember: The trailing dot means "this name is absolute." Without it, BIND appends the zone's origin. So in zone example.com, the entry app.example.com (no dot) becomes app.example.com.example.com. — a double-suffix that resolves to nothing.

2. Editing the zone file but not incrementing the serial¶

You add a new A record to the zone file. You reload BIND on the primary. The secondary servers ignore the update because the serial number did not increase. External clients querying the secondary get NXDOMAIN for your new record.

Fix: Always increment the serial before reloading. Use YYYYMMDDNN format. After editing, run rndc reload <zone> and verify the serial on both primary and secondary with dig SOA.

3. Setting a 24-hour TTL the day before a migration¶

Your A record has TTL 86400 (24 hours). You change the IP at 9 AM. Some ISP resolvers cached the old IP at 8:59 AM. Those clients hit the old IP for up to 24 hours. Your migration is "complete" but half your users are still going to the old server.

Fix: Lower the TTL to 60 seconds at least 48 hours before any planned IP change. Wait for the old high-TTL entries to expire before making the change. Raise TTL back after migration is verified.

Gotcha: Some ISP resolvers (notably older versions of Comcast and certain Asian ISPs) ignore low TTLs and cache for a minimum of 30 minutes regardless. Plan for a tail of stragglers even after TTL expiry. Monitor DNS query distribution across old and new IPs during migration.

4. CNAME at the zone apex¶

You try example.com. IN CNAME mycdn.cloudfront.net. because you want the naked domain to point to your CDN. This is illegal per the DNS RFC because the apex must have SOA and NS records, and CNAME cannot coexist with other record types.

Fix: Use your DNS provider's ALIAS or ANAME record (Route53, Cloudflare, DNSimple support this). Or use A records pointing directly to the CDN IPs. Never use CNAME at the zone apex.

5. Allowing zone transfers to the world¶

Your BIND config has allow-transfer { any; }; because you copied a tutorial config. Anyone can run dig AXFR example.com @your-ns and dump your entire zone — all hostnames, internal IPs, service names. This is a reconnaissance goldmine for attackers.

Fix: Restrict zone transfers to secondary nameserver IPs only. allow-transfer { 10.0.1.11; };. Verify with dig AXFR example.com @ns1.example.com from an unauthorized IP — it should fail.

6. Kubernetes ndots causing 4x DNS traffic¶

Your application in Kubernetes makes 1,000 external API calls per second. With the default ndots: 5, each call generates 5 DNS queries (4 NXDOMAIN cluster suffixes + 1 real). Your CoreDNS is handling 5,000 queries/sec instead of 1,000 and latency spikes.

Fix: Set dnsConfig.options: [{name: ndots, value: "2"}] on pods that primarily call external services. Or append a trailing dot to external hostnames in your application code.

Under the hood: With ndots: 5 (Kubernetes default), a lookup for api.stripe.com (3 dots, less than 5) first tries api.stripe.com.<namespace>.svc.cluster.local, then api.stripe.com.svc.cluster.local, then api.stripe.com.cluster.local, then api.stripe.com.<search-domain>, and finally the actual FQDN. Four wasted queries before the real one.

7. Deleting a DNS record that other records depend on¶

You delete the A record for ns1.example.com because you are migrating to new nameservers. But ns1.example.com is a glue record at the registrar. The delegation chain breaks. Your entire zone becomes unresolvable.

Fix: Update the delegation at the registrar first. Add new NS records and glue. Wait for propagation (check with dig +trace). Only then remove old NS/A records from the zone file.

8. Using /etc/hosts as a permanent DNS solution¶

You add entries to /etc/hosts on 50 servers to work around a DNS issue. Six months later, the IPs change. Nobody remembers the hosts file entries. Half your fleet talks to the old IPs while the other half uses DNS correctly.

Fix: Fix the actual DNS problem. If you must use /etc/hosts, document it, limit it to a known scope, and have automation (Ansible) manage it. Treat hosts file entries as temporary workarounds with an expiration date.

9. Pointing resolv.conf at a single DNS server¶

/etc/resolv.conf has one nameserver entry. That DNS server goes down. Every service on the machine that needs DNS resolution fails. Connection timeouts cascade across your application stack.

Fix: Always configure at least two nameservers in /etc/resolv.conf. Place them on different networks if possible. Add options timeout:2 attempts:3 to fail over faster.

10. Running a recursive resolver on a public-facing server¶

Your BIND server is authoritative for your zones but also has recursion yes and is accessible from the internet. Attackers use it as a DNS amplification relay for DDoS attacks. Your server IP gets blacklisted. Your legitimate DNS queries get rate-limited.

Fix: Disable recursion on authoritative servers. Run recursive resolvers on separate, internal-only servers. If you must serve both roles, use BIND views: recursion only for internal clients, authoritative-only for external.