Skip to content

Incident Replay: CNI Broken After Node Restart

Setup

  • System context: 5-node Kubernetes cluster using Calico CNI. After a rolling OS update, one node was rebooted and pods on that node cannot communicate with pods on other nodes.
  • Time: Wednesday 06:30 UTC
  • Your role: Platform engineer / on-call SRE

Round 1: Alert Fires

[Pressure cue: "PagerDuty fires — multiple pods on node k8s-worker-03 failing health checks. Cross-node service calls timing out. 5 minutes to auto-escalation."]

What you see: Pods on k8s-worker-03 can reach each other (same-node) but cannot reach pods on other nodes. kubectl get pods -o wide shows pods are Running but services backed by those pods return connection timeouts.

Choose your action: - A) Drain the node and reschedule pods elsewhere - B) Check the Calico pod status on the affected node - C) Restart kubelet on the affected node - D) Check if the node's network interface is up

[Result: kubectl get pods -n kube-system -o wide | grep calico shows the calico-node pod on k8s-worker-03 is in CrashLoopBackOff. The CNI plugin is not running. Proceed to Round 2.]

If you chose A:

[Result: Drain works but moves the problem — new pods on other nodes are fine. You still need to fix k8s-worker-03 to restore cluster capacity.]

If you chose C:

[Result: Kubelet restarts but the CNI plugin is still crashed. Pods continue to get IPs but cannot route cross-node. No improvement.]

If you chose D:

[Result: Node network is fine — SSH works, host-level connectivity is normal. The issue is at the CNI overlay level.]

Round 2: First Triage Data

[Pressure cue: "Services degraded due to pod scheduling concentration on 4 nodes. Need k8s-worker-03 back."]

What you see: kubectl logs -n kube-system calico-node-xxxxx shows: "Failed to initialize BIRD: could not load existing routes — permission denied on /var/run/calico." The OS update changed the permissions on /var/run/calico during the reboot.

Choose your action: - A) Fix the permissions: chmod 755 /var/run/calico and restart the calico-node pod - B) Delete the calico-node pod and let the DaemonSet recreate it - C) Reinstall Calico across the cluster - D) Check if /var/run/calico is a tmpfs and was recreated with wrong permissions

[Result: /var/run is a tmpfs that gets recreated on boot. The systemd tmpfiles.d config for calico was not preserved during the OS update — the directory is recreated with root-only permissions. The calico-node pod runs as a specific UID that needs access. Proceed to Round 3.]

If you chose A:

[Result: Fixes the symptom temporarily. The calico-node pod restarts and networking recovers. But the permissions will revert on the next reboot. Not a durable fix.]

If you chose B:

[Result: New pod hits the same permission error. DaemonSet recreation does not fix host-level permission issues.]

If you chose C:

[Result: Massive overkill for a single-node permission issue. Risk of cluster-wide disruption.]

Round 3: Root Cause Identification

[Pressure cue: "Identify the permanent fix."]

What you see: Root cause: The OS update replaced /etc/tmpfiles.d/calico.conf (which set correct permissions for /var/run/calico). The package manager's config merge chose the new default over the customized file. This only affected nodes that were rebooted after the update.

Choose your action: - A) Restore the calico tmpfiles.d config and fix permissions - B) Add the tmpfiles.d config to the node provisioning automation (Ansible/ignition) - C) Pin the OS package that conflicted with the calico config - D) Both A and B — fix now and prevent recurrence

[Result: Immediate fix: restore /etc/tmpfiles.d/calico.conf with correct permissions and restart calico-node. Prevention: add the config to Ansible node provisioning so it survives future OS updates. Proceed to Round 4.]

If you chose A:

[Result: Fixes this node but the next OS update will overwrite the config again.]

If you chose B:

[Result: Correct for prevention but does not fix the current node.]

If you chose C:

[Result: Pinning packages prevents updates, which creates security vulnerabilities over time.]

Round 4: Remediation

[Pressure cue: "Node recovered. Verify and prevent."]

Actions: 1. Verify calico-node pod is Running: kubectl get pods -n kube-system -o wide | grep calico 2. Verify cross-node pod connectivity: kubectl exec a pod on worker-03 and curl a pod on worker-01 3. Apply the tmpfiles.d fix to all nodes in the cluster (even those not yet rebooted) 4. Add the calico tmpfiles.d config to the Ansible node bootstrap playbook 5. Add a post-reboot CNI health check to the rolling update procedure

Damage Report

  • Total downtime: 0 (pods rescheduled; degraded capacity)
  • Blast radius: All pods on k8s-worker-03 lost cross-node connectivity; 1/5 cluster capacity degraded
  • Optimal resolution time: 12 minutes (check calico pod -> read logs -> fix tmpfiles -> restart)
  • If every wrong choice was made: 60+ minutes including unnecessary calico reinstall and node drain

Cross-References