Incident Replay: CoreDNS Timeout — Pod DNS Resolution Failing¶
Setup¶
- System context: Production Kubernetes cluster with 200+ pods. CoreDNS serves internal DNS. Multiple applications report intermittent DNS resolution failures.
- Time: Thursday 10:15 UTC
- Your role: Platform engineer / on-call SRE
Round 1: Alert Fires¶
[Pressure cue: "Multiple teams reporting 'connection refused' and 'name resolution failed' errors. Intermittent — some lookups work, some do not. Escalation in 5 minutes."]
What you see:
kubectl exec into a pod and nslookup kubernetes.default sometimes works, sometimes times out. CoreDNS pods show Ready 1/1 and Running. No obvious crashes.
Choose your action: - A) Restart the CoreDNS deployment - B) Check CoreDNS pod logs for errors - C) Scale up CoreDNS replicas to handle more load - D) Check if the issue is specific to certain nodes
If you chose B (recommended):¶
If you chose A:¶
[Result: CoreDNS restarts. DNS works for 30 seconds, then the same timeouts return. Not a pod-level issue.]
If you chose C:¶
[Result: More replicas hit the same upstream DNS issue. Scaling does not help when the backend is the bottleneck.]
If you chose D:¶
[Result: The issue affects all nodes equally — it is cluster-wide, not node-specific.]
Round 2: First Triage Data¶
[Pressure cue: "DNS is intermittent cluster-wide. Applications are failing SLA. This is a Sev-1."]
What you see:
CoreDNS is configured to forward external DNS queries to upstream resolvers at 10.0.0.2 and 10.0.0.3. kubectl exec into a CoreDNS pod and dig @10.0.0.2 times out. The upstream DNS server is unreachable from within the cluster.
Choose your action: - A) Change CoreDNS upstream to a public DNS (8.8.8.8) temporarily - B) Check NetworkPolicy or firewall rules blocking DNS egress from kube-system - C) Check if the upstream DNS servers are actually down - D) Check the CoreDNS Corefile for misconfigurations
If you chose B (recommended):¶
[Result: A NetworkPolicy was applied yesterday to the kube-system namespace that restricts egress. It allows port 53 to ClusterIP but blocks egress to the upstream DNS IPs (10.0.0.2/3). The policy was intended to restrict other pods but caught CoreDNS. Proceed to Round 3.]
If you chose A:¶
[Result: External queries would work but internal queries (kubernetes.default, svc.cluster.local) should not need upstream. The real issue is the NetworkPolicy blocking all egress including to upstream resolvers. Partial fix.]
If you chose C:¶
[Result: Upstream DNS servers are healthy — reachable from nodes outside the cluster. The block is from the pod network, not the host network.]
If you chose D:¶
[Result: Corefile is standard — forward to upstream, with caching. The config is correct; the network path is blocked.]
Round 3: Root Cause Identification¶
[Pressure cue: "NetworkPolicy identified. Fix it now."]
What you see:
Root cause: A security team member applied a default-deny egress NetworkPolicy to kube-system to prevent lateral movement. The policy correctly allowed DNS port 53 to cluster IPs but forgot to allow egress to the upstream DNS server IPs. CoreDNS could resolve cluster-internal names (via its own cache/watch) but external lookups failed.
Choose your action: - A) Delete the NetworkPolicy entirely - B) Add an egress rule allowing CoreDNS pods to reach upstream DNS IPs on port 53 - C) Move CoreDNS to a namespace without the restrictive policy - D) Remove the upstream forwarder from CoreDNS and use node-level resolv.conf
If you chose B (recommended):¶
[Result: Add an egress rule:
to: [{ipBlock: {cidr: 10.0.0.2/32}}, {ipBlock: {cidr: 10.0.0.3/32}}], ports: [{port: 53, protocol: UDP}, {port: 53, protocol: TCP}]. DNS resolution immediately recovers. Proceed to Round 4.]
If you chose A:¶
[Result: Fixes DNS but removes the security controls entirely. The security team will be unhappy.]
If you chose C:¶
[Result: Moving CoreDNS out of kube-system is non-standard and breaks many assumptions in Kubernetes.]
If you chose D:¶
[Result: Using node resolv.conf bypasses the issue but CoreDNS should handle upstream forwarding properly.]
Round 4: Remediation¶
[Pressure cue: "DNS is back. Harden the process."]
Actions:
1. Verify DNS resolution: kubectl exec <pod> -- nslookup google.com and nslookup kubernetes.default
2. Verify the NetworkPolicy allows CoreDNS egress to upstream DNS
3. Add CoreDNS egress requirements to the NetworkPolicy documentation
4. Add a DNS health check to the cluster smoke test suite
5. Require kube-system NetworkPolicy changes to be reviewed by the platform team
Damage Report¶
- Total downtime: 0 (intermittent DNS; services partially functional)
- Blast radius: All 200+ pods affected by intermittent DNS failures for ~12 hours
- Optimal resolution time: 10 minutes (check logs -> identify NetworkPolicy -> add egress rule)
- If every wrong choice was made: 60+ minutes including DNS config changes that break internal resolution
Cross-References¶
- Primer: Kubernetes Ops
- Primer: DNS Ops
- Primer: Kubernetes Networking
- Footguns: Kubernetes Ops