Skip to content

Runbook: Network Partition (Split Brain / Partial Connectivity)

Field Value
Domain Networking
Alert Multiple services unreachable, asymmetric connectivity, increased cross-node error rates
Severity P1
Est. Resolution Time 30-90 minutes
Escalation Timeout 20 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites kubectl access, SSH access to cluster nodes, network team contact information

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
kubectl get nodes -o wide && kubectl get pods -A -o wide | grep -v Running
If output shows: Several nodes NotReady AND pods in Pending/Unknown → Full network partition between node groups, proceed from Step 1 If output shows: Nodes all Ready but pods showing errors → This may be a CNI bug or application issue, not a true partition — see DNS Failure or Network Policy Block

Step 1: Map Which Pods/Nodes Can Reach Which

Why: A partition is not all-or-nothing. Mapping connectivity isolates which zone, subnet, or node group is affected before you touch anything.

# Find which nodes are affected
kubectl get nodes -o wide --show-labels

# List all non-Running pods and their node assignments
kubectl get pods -A -o wide | grep -v "Running\|Completed"

# From a healthy pod, test connectivity to a pod on each node
kubectl exec -it <HEALTHY_POD_NAME> -n <NAMESPACE> -- \
  ping -c 3 <POD_IP_ON_NODE_A>

kubectl exec -it <HEALTHY_POD_NAME> -n <NAMESPACE> -- \
  ping -c 3 <POD_IP_ON_NODE_B>
Expected output:
# Healthy: 0% packet loss between all pods
PING 10.244.2.5 (10.244.2.5): 56 data bytes
3 packets transmitted, 3 received, 0% packet loss
If this fails: Note exactly which node pairs cannot reach each other — this tells you if the partition is along AZ boundaries, a specific rack, or a single node.

Step 2: Check Cluster Network Plugin Status

Why: The CNI plugin (Calico, Cilium, Flannel) runs as a DaemonSet and manages overlay routing. A crashed CNI pod means that node loses pod-to-pod routing.

# Calico
kubectl get pods -n calico-system -o wide
kubectl get pods -n kube-system -l k8s-app=calico-node -o wide

# Cilium
kubectl get pods -n kube-system -l k8s-app=cilium -o wide
cilium status --wait 2>/dev/null || true

# Flannel
kubectl get pods -n kube-flannel -o wide
kubectl get pods -n kube-system -l app=flannel -o wide

# Check CNI logs on an affected node
kubectl logs -n <CNI_NAMESPACE> <CNI_POD_ON_AFFECTED_NODE> --tail=100
Expected output:
NAME                READY   STATUS    RESTARTS   AGE   NODE
calico-node-xxxxx   1/1     Running   0          5d    node-1
calico-node-yyyyy   1/1     Running   0          5d    node-2
If this fails: CrashLoopBackOff on CNI pods is a strong indicator the CNI configuration is broken. Check the CNI logs for errors. If Calico, check BGP — proceed to Step 4.

Step 3: Test Node-to-Node Connectivity

Why: Overlay networks like VXLAN and IPIP tunnel over the underlying node network. If the nodes themselves cannot communicate, no CNI overlay will work.

# SSH to an affected node
ssh <NODE_USERNAME>@<AFFECTED_NODE_IP>

# Test ICMP to all other nodes (replace with your node IPs)
for NODE_IP in <NODE_IP_1> <NODE_IP_2> <NODE_IP_3>; do
  echo -n "Pinging $NODE_IP: "
  ping -c 2 -W 1 $NODE_IP > /dev/null 2>&1 && echo "OK" || echo "FAILED"
done

# Test the tunnel port (VXLAN uses UDP 8472, Geneve uses UDP 6081)
nc -zvu <REMOTE_NODE_IP> 8472

# Check interface routing table
ip route show
Expected output:
Pinging 10.0.1.5: OK
Pinging 10.0.1.6: OK
Pinging 10.0.1.7: OK
If this fails: Node-level ICMP failures point to a physical network or cloud VPC routing issue — loop in network team immediately (this is beyond Kubernetes remediation). If only tunnel ports fail, check firewall/security group for UDP 8472.

Step 4: Check BGP Peering if Applicable

Why: Calico in BGP mode distributes pod CIDR routes via BGP. If BGP sessions drop, cross-node pod routing is lost even though nodes can ping each other.

# Check BGP peer status with calicoctl
calicoctl node status

# Or via kubectl if calicoctl is not available
kubectl exec -it -n calico-system <CALICO_NODE_POD> -- calico-node -bird-ready
kubectl exec -it -n calico-system <CALICO_NODE_POD> -- birdcl show protocols
Expected output:
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS |   PEER TYPE       | STATE |  SINCE   |    INFO     |
+--------------+-------------------+-------+----------+-------------+
| 10.0.1.1     | node-to-node mesh | up    | 2026-03-19 | Established |
If this fails: BGP sessions showing Idle or Connect mean the peering is broken. Check ASN configuration and ensure BGP port 179 is not blocked between nodes.

Step 5: Check for MTU Issues Causing Silent Packet Loss

Why: MTU mismatches can silently drop large packets, making some connections work (small requests) while others fail (large data transfers). This can look like a partition.

# Test with increasing packet sizes to find the MTU break point
# Do this between a pod on each side of the suspected partition
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- \
  ping -M do -s 1400 <REMOTE_POD_IP>

kubectl exec -it <POD_NAME> -n <NAMESPACE> -- \
  ping -M do -s 1450 <REMOTE_POD_IP>

# Check node interface MTU
ssh <NODE_USERNAME>@<NODE_IP> ip link show eth0
Expected output:
# 1400 byte ping should succeed; adjust size until you find the failure threshold
64 bytes from 10.244.2.5: icmp_seq=1 ttl=64 time=0.4 ms
If this fails: MTU mismatch confirmed — follow the MTU Mismatch runbook to resolve.

Step 6: Restart CNI Plugin on Affected Nodes

Why: If the CNI state is corrupted on specific nodes, a controlled restart re-establishes routes without requiring a node drain.

# Identify CNI pods on affected nodes
kubectl get pods -n <CNI_NAMESPACE> -o wide | grep <AFFECTED_NODE_NAME>

# Delete (restart) the CNI pod on the affected node
kubectl delete pod -n <CNI_NAMESPACE> <CNI_POD_NAME>

# Watch the pod come back and confirm it reaches Running
kubectl get pods -n <CNI_NAMESPACE> -w

# After restart, re-test pod connectivity from Step 1
kubectl exec -it <HEALTHY_POD_NAME> -n <NAMESPACE> -- ping -c 3 <POD_IP_ON_AFFECTED_NODE>
Expected output:
NAME              READY   STATUS    RESTARTS   AGE
calico-node-xxx   1/1     Running   0          30s
If this fails: If CNI pod crashes again immediately after restart, the node itself has a network configuration issue (bad route, broken interface). Cordon and drain the node: kubectl cordon <NODE_NAME> && kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data.

Verification

# Confirm the issue is resolved — test cross-node pod connectivity
kubectl exec -it <POD_ON_NODE_A> -n <NAMESPACE> -- ping -c 5 <POD_IP_ON_NODE_B>
kubectl get pods -A | grep -v "Running\|Completed" | wc -l
Success looks like: All pings succeed with 0% packet loss. Zero pods outside Running/Completed state. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 20 min Platform/Network on-call "Confirmed network partition in cluster , nodes cannot reach , all cross-partition traffic failing"
Data loss suspected SRE lead + application owners "Network partition may have caused split-brain in stateful services (databases, queues), requesting data integrity check"
Scope expanding beyond one cluster Infrastructure director "Multi-cluster network failure suspected, possible physical network or cloud provider incident"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete

Common Mistakes

  1. Assuming it's an app bug: When error rates spike suddenly across multiple unrelated services, rule out network partition before debugging application code. The blast radius pattern is the giveaway.
  2. Testing only pod-to-pod instead of node-to-node: Pod-to-pod tests go through the CNI overlay. Node-to-node tests go through the underlying infrastructure. You need both to isolate whether the problem is CNI or physical network.
  3. Not looping in the network team early enough: If node-to-node ICMP is failing (Step 3), Kubernetes cannot fix that — it requires action from whoever owns the physical or cloud network. Page them at the 20-minute mark, not after an hour of frustration.

Cross-References


Wiki Navigation