Skip to content

Solution

Triage

  1. Check calico-node pod on the affected node:
    kubectl get pods -n kube-system -l k8s-app=calico-node -o wide | grep node-4
    
  2. Check kubelet logs for CNI errors:
    journalctl -u kubelet --since "1 hour ago" | grep -i cni
    
  3. Verify CNI config and binaries exist on the node:
    ls -la /etc/cni/net.d/
    ls -la /opt/cni/bin/
    
  4. Check network interfaces:
    ip link show
    ip route | grep -i calico
    

Root Cause

The kernel upgrade changed the running kernel version. After reboot, the vxlan kernel module was not automatically loaded because the new kernel's module directory does not contain it (or it was not included in the initramfs). Without the vxlan module, Calico cannot create the vxlan.calico tunnel interface. The calico-node pod enters a crash loop or reports degraded status, and the CNI plugin fails to configure networking for new pods.

Additionally, the calico-node init container that installs CNI binaries and config may have failed silently, leaving stale or missing configuration files in /etc/cni/net.d/.

Fix

  1. Load the missing kernel module:

    modprobe vxlan
    
    If the module is missing entirely, install the matching kernel-modules package:
    apt-get install linux-modules-extra-$(uname -r)   # Debian/Ubuntu
    

  2. Ensure module loads on boot:

    echo "vxlan" >> /etc/modules-load.d/kubernetes.conf
    

  3. Restart the calico-node pod on this node:

    kubectl delete pod -n kube-system -l k8s-app=calico-node --field-selector spec.nodeName=node-4.internal
    

  4. Verify CNI recovery:

    ip link show vxlan.calico
    kubectl get pods -n kube-system -l k8s-app=calico-node -o wide | grep node-4
    

  5. Restart pods with broken networking: Existing pods that had stale network config will not self-heal. Delete them to let them be recreated with fresh CNI configuration:

    kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=node-4.internal --no-headers | awk '{print $1, $2}' | xargs -n2 kubectl delete pod -n
    

Rollback / Safety

  • If the kernel module cannot be loaded, consider rolling back the kernel upgrade.
  • Before deleting pods, verify they are part of a controller (Deployment, StatefulSet) that will recreate them.
  • Test cross-node pod connectivity after fix: kubectl exec <pod-on-node-4> -- ping <pod-ip-on-other-node>.

Common Traps

  • Assuming the CNI is fine because calico-node shows Running. The pod may be running but degraded. Check its logs.
  • Only restarting kubelet. Kubelet does not manage CNI plugin state. The CNI plugin (calico-node) must be restarted.
  • Not checking kernel modules after upgrade. Kernel upgrades can remove modules if the extras package is not installed.
  • Forgetting existing pods. Pods running before the CNI break have stale network config. They appear Running but cannot communicate. They must be recreated.
  • Not persisting the module load. modprobe is ephemeral. Without /etc/modules-load.d/ configuration, the next reboot will break again.