- networking
- l2
- runbook
- networking-troubleshooting --- Portal | Level: L2: Operations | Topics: Networking Troubleshooting | Domain: Networking
Runbook: Network Partition (Split Brain / Partial Connectivity)¶
| Field | Value |
|---|---|
| Domain | Networking |
| Alert | Multiple services unreachable, asymmetric connectivity, increased cross-node error rates |
| Severity | P1 |
| Est. Resolution Time | 30-90 minutes |
| Escalation Timeout | 20 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | kubectl access, SSH access to cluster nodes, network team contact information |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
kubectl get nodes -o wide && kubectl get pods -A -o wide | grep -v Running
NotReady AND pods in Pending/Unknown → Full network partition between node groups, proceed from Step 1
If output shows: Nodes all Ready but pods showing errors → This may be a CNI bug or application issue, not a true partition — see DNS Failure or Network Policy Block
Step 1: Map Which Pods/Nodes Can Reach Which¶
Why: A partition is not all-or-nothing. Mapping connectivity isolates which zone, subnet, or node group is affected before you touch anything.
# Find which nodes are affected
kubectl get nodes -o wide --show-labels
# List all non-Running pods and their node assignments
kubectl get pods -A -o wide | grep -v "Running\|Completed"
# From a healthy pod, test connectivity to a pod on each node
kubectl exec -it <HEALTHY_POD_NAME> -n <NAMESPACE> -- \
ping -c 3 <POD_IP_ON_NODE_A>
kubectl exec -it <HEALTHY_POD_NAME> -n <NAMESPACE> -- \
ping -c 3 <POD_IP_ON_NODE_B>
# Healthy: 0% packet loss between all pods
PING 10.244.2.5 (10.244.2.5): 56 data bytes
3 packets transmitted, 3 received, 0% packet loss
Step 2: Check Cluster Network Plugin Status¶
Why: The CNI plugin (Calico, Cilium, Flannel) runs as a DaemonSet and manages overlay routing. A crashed CNI pod means that node loses pod-to-pod routing.
# Calico
kubectl get pods -n calico-system -o wide
kubectl get pods -n kube-system -l k8s-app=calico-node -o wide
# Cilium
kubectl get pods -n kube-system -l k8s-app=cilium -o wide
cilium status --wait 2>/dev/null || true
# Flannel
kubectl get pods -n kube-flannel -o wide
kubectl get pods -n kube-system -l app=flannel -o wide
# Check CNI logs on an affected node
kubectl logs -n <CNI_NAMESPACE> <CNI_POD_ON_AFFECTED_NODE> --tail=100
NAME READY STATUS RESTARTS AGE NODE
calico-node-xxxxx 1/1 Running 0 5d node-1
calico-node-yyyyy 1/1 Running 0 5d node-2
Step 3: Test Node-to-Node Connectivity¶
Why: Overlay networks like VXLAN and IPIP tunnel over the underlying node network. If the nodes themselves cannot communicate, no CNI overlay will work.
# SSH to an affected node
ssh <NODE_USERNAME>@<AFFECTED_NODE_IP>
# Test ICMP to all other nodes (replace with your node IPs)
for NODE_IP in <NODE_IP_1> <NODE_IP_2> <NODE_IP_3>; do
echo -n "Pinging $NODE_IP: "
ping -c 2 -W 1 $NODE_IP > /dev/null 2>&1 && echo "OK" || echo "FAILED"
done
# Test the tunnel port (VXLAN uses UDP 8472, Geneve uses UDP 6081)
nc -zvu <REMOTE_NODE_IP> 8472
# Check interface routing table
ip route show
Step 4: Check BGP Peering if Applicable¶
Why: Calico in BGP mode distributes pod CIDR routes via BGP. If BGP sessions drop, cross-node pod routing is lost even though nodes can ping each other.
# Check BGP peer status with calicoctl
calicoctl node status
# Or via kubectl if calicoctl is not available
kubectl exec -it -n calico-system <CALICO_NODE_POD> -- calico-node -bird-ready
kubectl exec -it -n calico-system <CALICO_NODE_POD> -- birdcl show protocols
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+----------+-------------+
| 10.0.1.1 | node-to-node mesh | up | 2026-03-19 | Established |
Idle or Connect mean the peering is broken. Check ASN configuration and ensure BGP port 179 is not blocked between nodes.
Step 5: Check for MTU Issues Causing Silent Packet Loss¶
Why: MTU mismatches can silently drop large packets, making some connections work (small requests) while others fail (large data transfers). This can look like a partition.
# Test with increasing packet sizes to find the MTU break point
# Do this between a pod on each side of the suspected partition
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- \
ping -M do -s 1400 <REMOTE_POD_IP>
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- \
ping -M do -s 1450 <REMOTE_POD_IP>
# Check node interface MTU
ssh <NODE_USERNAME>@<NODE_IP> ip link show eth0
# 1400 byte ping should succeed; adjust size until you find the failure threshold
64 bytes from 10.244.2.5: icmp_seq=1 ttl=64 time=0.4 ms
Step 6: Restart CNI Plugin on Affected Nodes¶
Why: If the CNI state is corrupted on specific nodes, a controlled restart re-establishes routes without requiring a node drain.
# Identify CNI pods on affected nodes
kubectl get pods -n <CNI_NAMESPACE> -o wide | grep <AFFECTED_NODE_NAME>
# Delete (restart) the CNI pod on the affected node
kubectl delete pod -n <CNI_NAMESPACE> <CNI_POD_NAME>
# Watch the pod come back and confirm it reaches Running
kubectl get pods -n <CNI_NAMESPACE> -w
# After restart, re-test pod connectivity from Step 1
kubectl exec -it <HEALTHY_POD_NAME> -n <NAMESPACE> -- ping -c 3 <POD_IP_ON_AFFECTED_NODE>
kubectl cordon <NODE_NAME> && kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data.
Verification¶
# Confirm the issue is resolved — test cross-node pod connectivity
kubectl exec -it <POD_ON_NODE_A> -n <NAMESPACE> -- ping -c 5 <POD_IP_ON_NODE_B>
kubectl get pods -A | grep -v "Running\|Completed" | wc -l
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 20 min | Platform/Network on-call | "Confirmed network partition in cluster |
| Data loss suspected | SRE lead + application owners | "Network partition may have caused split-brain in stateful services (databases, queues), requesting data integrity check" |
| Scope expanding beyond one cluster | Infrastructure director | "Multi-cluster network failure suspected, possible physical network or cloud provider incident" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
Common Mistakes¶
- Assuming it's an app bug: When error rates spike suddenly across multiple unrelated services, rule out network partition before debugging application code. The blast radius pattern is the giveaway.
- Testing only pod-to-pod instead of node-to-node: Pod-to-pod tests go through the CNI overlay. Node-to-node tests go through the underlying infrastructure. You need both to isolate whether the problem is CNI or physical network.
- Not looping in the network team early enough: If node-to-node ICMP is failing (Step 3), Kubernetes cannot fix that — it requires action from whoever owns the physical or cloud network. Page them at the 20-minute mark, not after an hour of frustration.
Cross-References¶
- Topic Pack: Kubernetes Networking and CNI (deep background)
- Related Runbook: MTU Mismatch
Wiki Navigation¶
Related Content¶
- Networking Troubleshooting (Topic Pack, L1) — Networking Troubleshooting
- Runbook: DNS Resolution Failure (Runbook, L1) — Networking Troubleshooting
- Runbook: Load Balancer Health Check Failure (Runbook, L2) — Networking Troubleshooting
- Runbook: MTU Mismatch (Runbook, L2) — Networking Troubleshooting