Incident Replay: CNI Broken After Node Restart¶
Setup¶
- System context: 5-node Kubernetes cluster using Calico CNI. After a rolling OS update, one node was rebooted and pods on that node cannot communicate with pods on other nodes.
- Time: Wednesday 06:30 UTC
- Your role: Platform engineer / on-call SRE
Round 1: Alert Fires¶
[Pressure cue: "PagerDuty fires — multiple pods on node k8s-worker-03 failing health checks. Cross-node service calls timing out. 5 minutes to auto-escalation."]
What you see:
Pods on k8s-worker-03 can reach each other (same-node) but cannot reach pods on other nodes. kubectl get pods -o wide shows pods are Running but services backed by those pods return connection timeouts.
Choose your action: - A) Drain the node and reschedule pods elsewhere - B) Check the Calico pod status on the affected node - C) Restart kubelet on the affected node - D) Check if the node's network interface is up
If you chose B (recommended):¶
[Result:
kubectl get pods -n kube-system -o wide | grep calicoshows the calico-node pod on k8s-worker-03 is in CrashLoopBackOff. The CNI plugin is not running. Proceed to Round 2.]
If you chose A:¶
[Result: Drain works but moves the problem — new pods on other nodes are fine. You still need to fix k8s-worker-03 to restore cluster capacity.]
If you chose C:¶
[Result: Kubelet restarts but the CNI plugin is still crashed. Pods continue to get IPs but cannot route cross-node. No improvement.]
If you chose D:¶
[Result: Node network is fine — SSH works, host-level connectivity is normal. The issue is at the CNI overlay level.]
Round 2: First Triage Data¶
[Pressure cue: "Services degraded due to pod scheduling concentration on 4 nodes. Need k8s-worker-03 back."]
What you see:
kubectl logs -n kube-system calico-node-xxxxx shows: "Failed to initialize BIRD: could not load existing routes — permission denied on /var/run/calico." The OS update changed the permissions on /var/run/calico during the reboot.
Choose your action:
- A) Fix the permissions: chmod 755 /var/run/calico and restart the calico-node pod
- B) Delete the calico-node pod and let the DaemonSet recreate it
- C) Reinstall Calico across the cluster
- D) Check if /var/run/calico is a tmpfs and was recreated with wrong permissions
If you chose D (recommended):¶
[Result:
/var/runis a tmpfs that gets recreated on boot. The systemd tmpfiles.d config for calico was not preserved during the OS update — the directory is recreated with root-only permissions. The calico-node pod runs as a specific UID that needs access. Proceed to Round 3.]
If you chose A:¶
[Result: Fixes the symptom temporarily. The calico-node pod restarts and networking recovers. But the permissions will revert on the next reboot. Not a durable fix.]
If you chose B:¶
[Result: New pod hits the same permission error. DaemonSet recreation does not fix host-level permission issues.]
If you chose C:¶
[Result: Massive overkill for a single-node permission issue. Risk of cluster-wide disruption.]
Round 3: Root Cause Identification¶
[Pressure cue: "Identify the permanent fix."]
What you see:
Root cause: The OS update replaced /etc/tmpfiles.d/calico.conf (which set correct permissions for /var/run/calico). The package manager's config merge chose the new default over the customized file. This only affected nodes that were rebooted after the update.
Choose your action: - A) Restore the calico tmpfiles.d config and fix permissions - B) Add the tmpfiles.d config to the node provisioning automation (Ansible/ignition) - C) Pin the OS package that conflicted with the calico config - D) Both A and B — fix now and prevent recurrence
If you chose D (recommended):¶
[Result: Immediate fix: restore
/etc/tmpfiles.d/calico.confwith correct permissions and restart calico-node. Prevention: add the config to Ansible node provisioning so it survives future OS updates. Proceed to Round 4.]
If you chose A:¶
[Result: Fixes this node but the next OS update will overwrite the config again.]
If you chose B:¶
[Result: Correct for prevention but does not fix the current node.]
If you chose C:¶
[Result: Pinning packages prevents updates, which creates security vulnerabilities over time.]
Round 4: Remediation¶
[Pressure cue: "Node recovered. Verify and prevent."]
Actions:
1. Verify calico-node pod is Running: kubectl get pods -n kube-system -o wide | grep calico
2. Verify cross-node pod connectivity: kubectl exec a pod on worker-03 and curl a pod on worker-01
3. Apply the tmpfiles.d fix to all nodes in the cluster (even those not yet rebooted)
4. Add the calico tmpfiles.d config to the Ansible node bootstrap playbook
5. Add a post-reboot CNI health check to the rolling update procedure
Damage Report¶
- Total downtime: 0 (pods rescheduled; degraded capacity)
- Blast radius: All pods on k8s-worker-03 lost cross-node connectivity; 1/5 cluster capacity degraded
- Optimal resolution time: 12 minutes (check calico pod -> read logs -> fix tmpfiles -> restart)
- If every wrong choice was made: 60+ minutes including unnecessary calico reinstall and node drain
Cross-References¶
- Primer: Kubernetes Ops
- Primer: Kubernetes Networking
- Footguns: Kubernetes Ops