Incident Replay: DNS Resolution Slow¶
Setup¶
- System context: Production web application experiencing 5-second delays on every external API call. Internal services are fast but any DNS lookup for external hostnames takes 5+ seconds.
- Time: Wednesday 11:15 UTC
- Your role: On-call SRE
Round 1: Alert Fires¶
[Pressure cue: "Application response time jumped from 200ms to 5200ms. All external API integrations are slow. Internal services are fine. Users are complaining."]
What you see:
dig @10.0.0.53 api.partner.com takes 5.1 seconds to respond. dig @8.8.8.8 api.partner.com responds in 20ms from the same server. The internal DNS resolver (10.0.0.53) is slow for all external lookups.
Choose your action: - A) Switch all servers to use 8.8.8.8 directly - B) Check the internal DNS resolver's health and upstream configuration - C) Flush the DNS cache on the resolver - D) Check if the resolver is overloaded
If you chose B (recommended):¶
[Result: The internal resolver (BIND) forwards external queries to two upstream resolvers: 10.0.0.2 (primary) and 10.0.0.3 (secondary).
dig @10.0.0.2 api.partner.comtimes out completely.dig @10.0.0.3 api.partner.comresponds in 15ms. The primary upstream is dead, and BIND waits 5 seconds before failing over to the secondary. Proceed to Round 2.]
If you chose A:¶
[Result: Bypasses the issue but now all servers directly query Google DNS, bypassing internal DNS policies, split-horizon, and caching. Security and compliance concerns.]
If you chose C:¶
[Result: Cache flush does not help — the issue is the upstream forwarder timeout, not stale cache entries.]
If you chose D:¶
[Result: Resolver CPU and memory are normal. It is not overloaded — it is waiting on a dead upstream.]
Round 2: First Triage Data¶
[Pressure cue: "Every external DNS lookup adds 5 seconds. Application SLA is breached."]
What you see:
The primary upstream forwarder 10.0.0.2 is a DNS appliance that crashed overnight. BIND's forward first policy queries the primary, waits 5 seconds for timeout, then queries the secondary. Every external lookup incurs this 5-second penalty.
Choose your action: - A) Remove the dead upstream from BIND's forwarder list temporarily - B) Reduce BIND's forwarder timeout from 5 seconds to 1 second - C) Fix the crashed DNS appliance at 10.0.0.2 - D) Reorder the forwarders to put the working one first
If you chose A (recommended):¶
[Result: Comment out
10.0.0.2from the forwarders list and reload BIND:rndc reload. External DNS lookups now go directly to 10.0.0.3 and resolve in <20ms. Application response time returns to normal. Proceed to Round 3.]
If you chose B:¶
[Result: Reduces the penalty from 5 seconds to 1 second but every external lookup still has an unnecessary 1-second delay.]
If you chose C:¶
[Result: Fixing the appliance is the right long-term action but it could take hours. The application cannot wait.]
If you chose D:¶
[Result: BIND queries all forwarders and uses the first response. Reordering may not have the desired effect depending on the BIND version and configuration.]
Round 3: Root Cause Identification¶
[Pressure cue: "DNS fast again. Fix the DNS appliance and prevent recurrence."]
What you see: Root cause: Primary DNS forwarder crashed overnight. No health monitoring on the DNS forwarder. BIND's default 5-second timeout added latency to every external lookup. No automatic failover from primary to secondary.
Choose your action: - A) Add DNS forwarder health monitoring and alerting - B) Configure BIND to use both forwarders concurrently (blast mode) - C) Add a third forwarder for redundancy - D) All of the above — plus fix the crashed appliance
If you chose D (recommended):¶
[Result: Appliance restarted and root-caused (memory leak). Health monitoring added. Third forwarder deployed. BIND reconfigured for better failover. Proceed to Round 4.]
If you chose A:¶
[Result: Detection improves but response is still manual.]
If you chose B:¶
[Result: Concurrent queries avoid the timeout penalty but double the query load to upstream.]
If you chose C:¶
[Result: More redundancy helps but you still need monitoring.]
Round 4: Remediation¶
[Pressure cue: "DNS healthy. Application response times nominal. Close."]
Actions:
1. Verify DNS response times: dig api.partner.com shows <50ms
2. Verify all forwarders are healthy
3. Re-add the fixed primary forwarder to BIND's config
4. Add DNS response time monitoring and alerting
5. Add forwarder health checks to monitoring
Damage Report¶
- Total downtime: 0 (services worked, just slowly)
- Blast radius: All external API integrations delayed by 5 seconds; application SLA breached for ~10 hours
- Optimal resolution time: 5 minutes (identify dead forwarder -> remove from config -> reload)
- If every wrong choice was made: 2+ hours with DNS config changes and security policy debates
Cross-References¶
- Primer: DNS Ops
- Primer: DNS Deep Dive
- Primer: Networking
- Footguns: DNS Ops