Incident Replay: CoreDNS Timeout — Pod DNS Resolution Failing¶

Setup¶

System context: Production Kubernetes cluster with 200+ pods. CoreDNS serves internal DNS. Multiple applications report intermittent DNS resolution failures.
Time: Thursday 10:15 UTC
Your role: Platform engineer / on-call SRE

Round 1: Alert Fires¶

[Pressure cue: "Multiple teams reporting 'connection refused' and 'name resolution failed' errors. Intermittent — some lookups work, some do not. Escalation in 5 minutes."]

What you see: kubectl exec into a pod and nslookup kubernetes.default sometimes works, sometimes times out. CoreDNS pods show Ready 1/1 and Running. No obvious crashes.

Choose your action: - A) Restart the CoreDNS deployment - B) Check CoreDNS pod logs for errors - C) Scale up CoreDNS replicas to handle more load - D) Check if the issue is specific to certain nodes

If you chose B (recommended):¶

If you chose A:¶

[Result: CoreDNS restarts. DNS works for 30 seconds, then the same timeouts return. Not a pod-level issue.]

If you chose C:¶

[Result: More replicas hit the same upstream DNS issue. Scaling does not help when the backend is the bottleneck.]

If you chose D:¶

[Result: The issue affects all nodes equally — it is cluster-wide, not node-specific.]

Round 2: First Triage Data¶

[Pressure cue: "DNS is intermittent cluster-wide. Applications are failing SLA. This is a Sev-1."]

What you see: CoreDNS is configured to forward external DNS queries to upstream resolvers at 10.0.0.2 and 10.0.0.3. kubectl exec into a CoreDNS pod and dig @10.0.0.2 times out. The upstream DNS server is unreachable from within the cluster.

Choose your action: - A) Change CoreDNS upstream to a public DNS (8.8.8.8) temporarily - B) Check NetworkPolicy or firewall rules blocking DNS egress from kube-system - C) Check if the upstream DNS servers are actually down - D) Check the CoreDNS Corefile for misconfigurations

If you chose B (recommended):¶

[Result: A NetworkPolicy was applied yesterday to the kube-system namespace that restricts egress. It allows port 53 to ClusterIP but blocks egress to the upstream DNS IPs (10.0.0.2/3). The policy was intended to restrict other pods but caught CoreDNS. Proceed to Round 3.]

If you chose A:¶

[Result: External queries would work but internal queries (kubernetes.default, svc.cluster.local) should not need upstream. The real issue is the NetworkPolicy blocking all egress including to upstream resolvers. Partial fix.]

If you chose C:¶

[Result: Upstream DNS servers are healthy — reachable from nodes outside the cluster. The block is from the pod network, not the host network.]

If you chose D:¶

[Result: Corefile is standard — forward to upstream, with caching. The config is correct; the network path is blocked.]

Round 3: Root Cause Identification¶

[Pressure cue: "NetworkPolicy identified. Fix it now."]

What you see: Root cause: A security team member applied a default-deny egress NetworkPolicy to kube-system to prevent lateral movement. The policy correctly allowed DNS port 53 to cluster IPs but forgot to allow egress to the upstream DNS server IPs. CoreDNS could resolve cluster-internal names (via its own cache/watch) but external lookups failed.

Choose your action: - A) Delete the NetworkPolicy entirely - B) Add an egress rule allowing CoreDNS pods to reach upstream DNS IPs on port 53 - C) Move CoreDNS to a namespace without the restrictive policy - D) Remove the upstream forwarder from CoreDNS and use node-level resolv.conf

If you chose B (recommended):¶

[Result: Add an egress rule: to: [{ipBlock: {cidr: 10.0.0.2/32}}, {ipBlock: {cidr: 10.0.0.3/32}}], ports: [{port: 53, protocol: UDP}, {port: 53, protocol: TCP}]. DNS resolution immediately recovers. Proceed to Round 4.]

If you chose A:¶

[Result: Fixes DNS but removes the security controls entirely. The security team will be unhappy.]

If you chose C:¶

[Result: Moving CoreDNS out of kube-system is non-standard and breaks many assumptions in Kubernetes.]

If you chose D:¶

[Result: Using node resolv.conf bypasses the issue but CoreDNS should handle upstream forwarding properly.]

Round 4: Remediation¶

[Pressure cue: "DNS is back. Harden the process."]

Actions: 1. Verify DNS resolution: kubectl exec <pod> -- nslookup google.com and nslookup kubernetes.default 2. Verify the NetworkPolicy allows CoreDNS egress to upstream DNS 3. Add CoreDNS egress requirements to the NetworkPolicy documentation 4. Add a DNS health check to the cluster smoke test suite 5. Require kube-system NetworkPolicy changes to be reviewed by the platform team

Damage Report¶

Total downtime: 0 (intermittent DNS; services partially functional)
Blast radius: All 200+ pods affected by intermittent DNS failures for ~12 hours
Optimal resolution time: 10 minutes (check logs -> identify NetworkPolicy -> add egress rule)
If every wrong choice was made: 60+ minutes including DNS config changes that break internal resolution

Cross-References¶

Primer: Kubernetes Ops
Primer: DNS Ops
Primer: Kubernetes Networking
Footguns: Kubernetes Ops

Incident Replay: CoreDNS Timeout — Pod DNS Resolution Failing¶

Setup¶

Round 1: Alert Fires¶

If you chose B (recommended):¶

If you chose A:¶

If you chose C:¶

If you chose D:¶

Round 2: First Triage Data¶

If you chose B (recommended):¶

If you chose A:¶

If you chose C:¶

If you chose D:¶

Round 3: Root Cause Identification¶

If you chose B (recommended):¶

If you chose A:¶

If you chose C:¶

If you chose D:¶

Round 4: Remediation¶

Damage Report¶

Cross-References¶

Pages that link here¶