Incident Replay: DNS Split-Horizon Confusion¶
Setup¶
- System context: Corporate network with split-horizon DNS. Internal hostname resolves to private IP internally, public IP externally. An application on the internal network is hitting the public IP.
- Time: Friday 14:00 UTC
- Your role: Network engineer / on-call SRE
Round 1: Alert Fires¶
[Pressure cue: "Application team reports intermittent timeouts connecting to api.company.com from internal servers. Works from laptops on the office WiFi."]
What you see:
dig api.company.com from the server returns the public IP (203.0.113.10) instead of the internal IP (10.1.1.50). Traffic goes out to the internet, through the firewall, and NAT hairpins back in — adding 50ms latency and sometimes failing.
Choose your action: - A) Add a /etc/hosts entry for the internal IP - B) Check the DNS resolver configuration on the server - C) Check if the firewall is causing the timeout - D) Contact the application team about their DNS settings
If you chose A:¶
[Result: Works for this server but not scalable to 200+ servers. Band-aid.]
If you chose B (recommended):¶
[Result: Server's /etc/resolv.conf points to 8.8.8.8 instead of the internal DNS server 10.0.0.53. The server was provisioned with public DNS. Internal split-horizon view is never queried. Proceed to Round 2.]
If you chose C:¶
[Result: Firewall NAT hairpin works but adds latency. The issue is resolving the wrong IP, not the firewall path.]
If you chose D:¶
[Result: Application uses system DNS. The problem is the system resolver, not the app.]
Round 2: First Triage Data¶
[Pressure cue: "Problem scoped. Apply the fix."]
What you see: Root cause from Round 1 narrows the investigation. You need to apply the correct fix and verify.
Choose your action: - A) Apply the quick targeted fix - B) Apply the comprehensive fix with verification - C) Apply a workaround while planning the proper fix - D) Escalate to a specialist team
If you chose A:¶
[Result: Quick fix resolves the immediate issue but may not be durable. Proceed cautiously.]
If you chose B (recommended):¶
[Result: Comprehensive fix applied with verification steps. Issue resolved. Proceed to Round 3.]
If you chose C:¶
[Result: Workaround buys time but the root cause remains. Acceptable short-term.]
If you chose D:¶
[Result: Specialist is unavailable or adds delay. Try the fix yourself first.]
Round 3: Root Cause Identification¶
[Pressure cue: "Fix applied. Document root cause and prevention."]
What you see: Root cause is confirmed. Process or configuration gap that allowed this to happen is identified.
Choose your action: - A) Fix the specific instance only - B) Fix the instance and add monitoring - C) Fix the instance, add monitoring, and update procedures - D) Comprehensive: fix + monitor + procedure + automation
If you chose D (recommended):¶
[Result: All layers addressed. Immediate fix, detection, process, and automation. Proceed to Round 4.]
If you chose A:¶
[Result: Fixes this case but the same mistake can recur.]
If you chose B:¶
[Result: Better detection next time but does not prevent recurrence.]
If you chose C:¶
[Result: Good coverage but automation reduces human error further.]
Round 4: Remediation¶
[Pressure cue: "Service restored. Verify and close."]
Actions: 1. Verify the service is functioning correctly 2. Verify monitoring detects the fix 3. Update runbooks and procedures 4. Schedule follow-up actions (automation, infrastructure changes) 5. Close the incident with a post-mortem
Damage Report¶
- Total downtime: Varies based on path taken
- Blast radius: Affected service and dependent systems
- Optimal resolution time: 10 minutes
- If every wrong choice was made: 90 minutes + additional damage
Cross-References¶
- Primer: Networking
- Primer: Networking Troubleshooting
- Footguns: Networking