Skip to content

Incident Replay: DNS Resolution Slow

Setup

  • System context: Production web application experiencing 5-second delays on every external API call. Internal services are fast but any DNS lookup for external hostnames takes 5+ seconds.
  • Time: Wednesday 11:15 UTC
  • Your role: On-call SRE

Round 1: Alert Fires

[Pressure cue: "Application response time jumped from 200ms to 5200ms. All external API integrations are slow. Internal services are fine. Users are complaining."]

What you see: dig @10.0.0.53 api.partner.com takes 5.1 seconds to respond. dig @8.8.8.8 api.partner.com responds in 20ms from the same server. The internal DNS resolver (10.0.0.53) is slow for all external lookups.

Choose your action: - A) Switch all servers to use 8.8.8.8 directly - B) Check the internal DNS resolver's health and upstream configuration - C) Flush the DNS cache on the resolver - D) Check if the resolver is overloaded

[Result: The internal resolver (BIND) forwards external queries to two upstream resolvers: 10.0.0.2 (primary) and 10.0.0.3 (secondary). dig @10.0.0.2 api.partner.com times out completely. dig @10.0.0.3 api.partner.com responds in 15ms. The primary upstream is dead, and BIND waits 5 seconds before failing over to the secondary. Proceed to Round 2.]

If you chose A:

[Result: Bypasses the issue but now all servers directly query Google DNS, bypassing internal DNS policies, split-horizon, and caching. Security and compliance concerns.]

If you chose C:

[Result: Cache flush does not help — the issue is the upstream forwarder timeout, not stale cache entries.]

If you chose D:

[Result: Resolver CPU and memory are normal. It is not overloaded — it is waiting on a dead upstream.]

Round 2: First Triage Data

[Pressure cue: "Every external DNS lookup adds 5 seconds. Application SLA is breached."]

What you see: The primary upstream forwarder 10.0.0.2 is a DNS appliance that crashed overnight. BIND's forward first policy queries the primary, waits 5 seconds for timeout, then queries the secondary. Every external lookup incurs this 5-second penalty.

Choose your action: - A) Remove the dead upstream from BIND's forwarder list temporarily - B) Reduce BIND's forwarder timeout from 5 seconds to 1 second - C) Fix the crashed DNS appliance at 10.0.0.2 - D) Reorder the forwarders to put the working one first

[Result: Comment out 10.0.0.2 from the forwarders list and reload BIND: rndc reload. External DNS lookups now go directly to 10.0.0.3 and resolve in <20ms. Application response time returns to normal. Proceed to Round 3.]

If you chose B:

[Result: Reduces the penalty from 5 seconds to 1 second but every external lookup still has an unnecessary 1-second delay.]

If you chose C:

[Result: Fixing the appliance is the right long-term action but it could take hours. The application cannot wait.]

If you chose D:

[Result: BIND queries all forwarders and uses the first response. Reordering may not have the desired effect depending on the BIND version and configuration.]

Round 3: Root Cause Identification

[Pressure cue: "DNS fast again. Fix the DNS appliance and prevent recurrence."]

What you see: Root cause: Primary DNS forwarder crashed overnight. No health monitoring on the DNS forwarder. BIND's default 5-second timeout added latency to every external lookup. No automatic failover from primary to secondary.

Choose your action: - A) Add DNS forwarder health monitoring and alerting - B) Configure BIND to use both forwarders concurrently (blast mode) - C) Add a third forwarder for redundancy - D) All of the above — plus fix the crashed appliance

[Result: Appliance restarted and root-caused (memory leak). Health monitoring added. Third forwarder deployed. BIND reconfigured for better failover. Proceed to Round 4.]

If you chose A:

[Result: Detection improves but response is still manual.]

If you chose B:

[Result: Concurrent queries avoid the timeout penalty but double the query load to upstream.]

If you chose C:

[Result: More redundancy helps but you still need monitoring.]

Round 4: Remediation

[Pressure cue: "DNS healthy. Application response times nominal. Close."]

Actions: 1. Verify DNS response times: dig api.partner.com shows <50ms 2. Verify all forwarders are healthy 3. Re-add the fixed primary forwarder to BIND's config 4. Add DNS response time monitoring and alerting 5. Add forwarder health checks to monitoring

Damage Report

  • Total downtime: 0 (services worked, just slowly)
  • Blast radius: All external API integrations delayed by 5 seconds; application SLA breached for ~10 hours
  • Optimal resolution time: 5 minutes (identify dead forwarder -> remove from config -> reload)
  • If every wrong choice was made: 2+ hours with DNS config changes and security policy debates

Cross-References