Skip to content

Anti-Primer: DNS Deep Dive

Everything that can go wrong, will — and in this story, it does.

The Setup

A team is migrating their internal DNS infrastructure from BIND to CoreDNS during a scheduled maintenance window. The migration affects service discovery for 200 microservices. The engineer assumes a clean cutover will 'just work.'

The Timeline

Hour 0: Missing Search Domain

New resolv.conf does not include the internal search domain. The deadline was looming, and this seemed like the fastest path forward. But the result is short hostnames like db-primary stop resolving; every service using short names breaks.

Footgun #1: Missing Search Domain — new resolv.conf does not include the internal search domain, leading to short hostnames like db-primary stop resolving; every service using short names breaks.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: TTL Mismatch

Sets TTL to 0 on all records for 'instant updates'. Under time pressure, the team chose speed over caution. But the result is DNS query volume increases 100x; CoreDNS is overwhelmed and starts dropping queries.

Footgun #2: TTL Mismatch — sets TTL to 0 on all records for 'instant updates', leading to DNS query volume increases 100x; CoreDNS is overwhelmed and starts dropping queries.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Recursive Loop

Configures CoreDNS to forward unresolved queries to itself. Nobody pushed back because the shortcut looked harmless in the moment. But the result is DNS requests loop until the stack overflows; all name resolution fails.

Footgun #3: Recursive Loop — configures CoreDNS to forward unresolved queries to itself, leading to DNS requests loop until the stack overflows; all name resolution fails.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: DNSSEC Validation Mismatch

Enables DNSSEC validation on the new server but internal zones are not signed. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is all internal DNS queries fail validation; services cannot discover each other.

Footgun #4: DNSSEC Validation Mismatch — enables DNSSEC validation on the new server but internal zones are not signed, leading to all internal DNS queries fail validation; services cannot discover each other.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 Missing Search Domain Short hostnames like db-primary stop resolving; every service using short names breaks Primer: Verify search domains in resolv.conf match the old configuration exactly
2 TTL Mismatch DNS query volume increases 100x; CoreDNS is overwhelmed and starts dropping queries Primer: Use appropriate TTLs (30-300 seconds); TTL 0 creates massive query amplification
3 Recursive Loop DNS requests loop until the stack overflows; all name resolution fails Primer: Forward to upstream resolvers, never to self; test with dig before cutover
4 DNSSEC Validation Mismatch All internal DNS queries fail validation; services cannot discover each other Primer: Only enable DNSSEC validation for zones that are actually signed

Damage Report

  • Downtime: 1-4 hours of connectivity loss or degraded throughput
  • Data loss: None directly, but dependent services may lose in-flight data
  • Customer impact: Timeouts, connection failures, or complete network unreachability
  • Engineering time to remediate: 8-16 engineer-hours including physical layer verification
  • Reputation cost: Network team credibility damaged; possible SLA credits to internal customers

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on missing search domain, they would have learned: Verify search domains in resolv.conf match the old configuration exactly.
  • Footgun #2: If the engineer had read the primer, section on ttl mismatch, they would have learned: Use appropriate TTLs (30-300 seconds); TTL 0 creates massive query amplification.
  • Footgun #3: If the engineer had read the primer, section on recursive loop, they would have learned: Forward to upstream resolvers, never to self; test with dig before cutover.
  • Footgun #4: If the engineer had read the primer, section on dnssec validation mismatch, they would have learned: Only enable DNSSEC validation for zones that are actually signed.

Cross-References