Cilium & eBPF Networking - Street-Level Ops¶
Quick Diagnosis Commands¶
# --- Cilium Overall Health ---
# Check Cilium agent status on all nodes
cilium status
cilium status --all-health # verbose health breakdown
# Check Cilium operator and agent pods
kubectl get pods -n kube-system -l k8s-app=cilium
kubectl get pods -n kube-system -l name=cilium-operator
# Validate Cilium installation
cilium connectivity test # end-to-end connectivity test (takes ~5 min)
# --- Endpoint Status ---
# List all endpoints on the local node
cilium endpoint list
cilium endpoint list --output json | jq '.[] | {id, name: .status.identity.labels}'
# Get detailed endpoint status
cilium endpoint get <endpoint-id>
cilium endpoint get <pod-name> # by pod name also works
# Check endpoint policy enforcement
cilium endpoint get <id> | grep -A5 "policy"
# --- Hubble (Observability) ---
# Enable Hubble UI port-forward
cilium hubble ui
# Enable Hubble CLI
cilium hubble port-forward &
# Watch all flows in real time
hubble observe --follow
# Watch flows for a specific pod
hubble observe --from-pod default/myapp --follow
# Watch dropped flows only (policy violations)
hubble observe --verdict DROPPED --follow
# Filter by namespace
hubble observe -n production --follow
# DNS flows
hubble observe --protocol dns --follow
# L7 HTTP flows
hubble observe --protocol http --follow
# --- Network Policy Status ---
# List all CiliumNetworkPolicies
kubectl get ciliumnetworkpolicy -A
kubectl get cnp -A # shorthand
# Describe a policy
kubectl describe cnp my-policy -n production
# Check what policies apply to a pod
cilium policy get
cilium endpoint get <id> | jq '.status.policy'
# Verify a NetworkPolicy was imported
cilium policy get | grep -A10 "my-policy"
# --- BPF Map Inspection ---
# List all BPF maps
cilium bpf map list
# View CT (connection tracking) table size
cilium bpf ct list global | wc -l
# View NAT table
cilium bpf nat list
# Check LoadBalancer services
cilium bpf lb list
# View endpoint map
cilium bpf endpoint list
# --- Node Connectivity ---
# Check node health (Cilium internal health checks)
cilium-health status
cilium-health status --verbose
# Run path test between nodes
cilium-health ping <node-ip>
# Check if kube-proxy replacement is active
cilium status | grep "KubeProxyReplacement"
Debug clue: When
hubble observe --verdict DROPPEDshows drops with reasonPOLICY_DENIED, the source identity and destination identity are printed in the output. Usecilium identity get <id>to resolve the numeric identity back to pod labels -- this tells you exactly which security identity pair is missing from the policy.
Common Scenarios¶
Scenario 1: Pod Cannot Reach Another Pod (Policy Dropped)¶
Traffic between pods is being dropped. Application returns connection refused or timeout.
# Step 1: Check if it's a policy drop
hubble observe --from-pod <namespace>/<source-pod> \
--to-pod <namespace>/<dest-pod> --verdict DROPPED
# Step 2: If drops found, check the drop reason
hubble observe --from-pod <namespace>/<source-pod> \
--verdict DROPPED --output json | jq '{
src: .source.pod_name,
dst: .destination.pod_name,
reason: .drop_reason_desc,
policy: .traffic_direction
}'
# Step 3: Inspect policies on the destination endpoint
cilium endpoint get <dest-pod-name> | jq '.status.policy.realized'
# Step 4: Check ingress policy on destination
kubectl get cnp -n <namespace> -o yaml | grep -A20 "ingress"
# Step 5: For standard NetworkPolicy, check it's imported correctly
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy <name> -n <namespace>
# Step 6: Temporarily check without policy (test only)
# Add a "allow all" NetworkPolicy to the namespace to confirm it's a policy issue
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-all-debug
namespace: <namespace>
spec:
podSelector: {}
ingress: [] # empty = allow all
egress: []
policyTypes: [Ingress, Egress]
EOF
# Test, then delete
kubectl delete networkpolicy allow-all-debug -n <namespace>
Scenario 2: Cilium Agent Crashing or Restarting¶
Cilium agent pods in CrashLoopBackOff or repeatedly restarting.
# Step 1: Check why the agent is failing
kubectl logs -n kube-system <cilium-pod> --previous | tail -50
# Step 2: Check node kernel version (Cilium requires 4.9.17+, some features need 5.10+)
uname -r
# Note: kube-proxy replacement, bandwidth manager, and BBR congestion control require 5.10+
# Step 3: Check BPF filesystem is mounted
mount | grep bpf
# Should show: bpffs on /sys/fs/bpf type bpf
# Step 4: Check for BPF map corruption
cilium bpf map list # if this errors, maps may be corrupted
# Recovery: restart Cilium agent (it will rebuild maps)
kubectl rollout restart daemonset/cilium -n kube-system
# Step 5: Check Cilium config for misconfiguration
kubectl get configmap cilium-config -n kube-system -o yaml
# Step 6: Check etcd/KVStore connectivity (if using external KV store)
cilium status | grep "KVStore"
# Step 7: Check Cilium operator logs
kubectl logs -n kube-system deploy/cilium-operator | tail -50
# Step 8: Validate Cilium configuration
cilium config view
Scenario 3: kube-proxy Replacement Issue — Services Not Reachable¶
After enabling kubeProxyReplacement=true, ClusterIP or NodePort services stop working.
# Step 1: Verify kube-proxy replacement is fully active
cilium status | grep "KubeProxyReplacement"
# Should show: KubeProxyReplacement: True
# Step 2: Check if kube-proxy is still running (conflict)
kubectl get pods -n kube-system | grep kube-proxy
# If running alongside Cilium kube-proxy replacement: potential conflict
# Step 3: Verify BPF LB maps have service entries
cilium bpf lb list | grep <service-clusterip>
# Step 4: Check service backend health
cilium bpf lb list --by-service | grep <service-name>
# Step 5: Test service reachability from an endpoint
cilium endpoint exec <id> -- curl http://<clusterip>:<port>/health
# Step 6: Verify Cilium has kube-apiserver connectivity (required for kube-proxy replacement)
cilium status | grep "Cluster Mesh"
kubectl logs -n kube-system <cilium-pod> | grep "kube-proxy\|nodePort"
# Step 7: Check if NodePort forwarding is configured correctly
cilium status | grep "NodePort"
Scenario 4: L7 HTTP Policy Not Enforcing¶
CiliumNetworkPolicy with L7 HTTP path matching is not blocking requests it should block.
# Step 1: Confirm the policy is imported and valid
kubectl get cnp myapp-l7-policy -n production
kubectl describe cnp myapp-l7-policy -n production
# Look for "Status: OK"
# Step 2: Check Hubble for L7 flows
hubble observe --from-pod production/client \
--to-pod production/myapp \
--protocol http --follow
# Step 3: Confirm Envoy proxy is active on the endpoint (required for L7)
cilium endpoint get <myapp-endpoint-id> | jq '.status.proxy'
# Should show proxy port active
# Step 4: Check proxy (Envoy) logs
cilium proxy log <endpoint-id>
# Step 5: Validate L7 policy syntax
# L7 policy in CiliumNetworkPolicy:
cat <<EOF | kubectl apply -f -
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: myapp-l7-policy
namespace: production
spec:
endpointSelector:
matchLabels:
app: myapp
ingress:
- fromEndpoints:
- matchLabels:
app: client
toPorts:
- ports:
- port: "8080"
protocol: TCP
rules:
http:
- method: GET
path: /api/.* # Only allow GET /api/*
EOF
# Step 6: Check if issue is Envoy not starting
kubectl logs -n kube-system <cilium-pod> | grep -i "envoy\|proxy"
Key Patterns¶
Hubble Observability in Production¶
# Find the top talkers in a namespace
hubble observe -n production --output json | \
jq -r '[.source.pod_name, .destination.pod_name] | @csv' | \
sort | uniq -c | sort -rn | head -20
# Policy audit mode: see what would be blocked
hubble observe --verdict DROPPED -n production --output json | \
jq '{src: .source.pod_name, dst: .destination.pod_name, reason: .drop_reason_desc}'
# DNS failure detection
hubble observe --protocol dns --verdict DROPPED --follow
# Monitor inter-namespace traffic
hubble observe --from-namespace frontend --to-namespace backend --follow
CiliumNetworkPolicy vs Standard NetworkPolicy¶
# Standard Kubernetes NetworkPolicy (limited to L3/L4)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend
namespace: backend
spec:
podSelector:
matchLabels:
app: api
ingress:
- from:
- namespaceSelector:
matchLabels:
name: frontend
ports:
- port: 8080
---
# CiliumNetworkPolicy (L7 HTTP rules, CIDR, DNS FQDN)
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: allow-external-api
namespace: backend
spec:
endpointSelector:
matchLabels:
app: api
egress:
- toFQDNs:
- matchName: api.github.com
toPorts:
- ports:
- port: "443"
protocol: TCP
Gotcha: Standard Kubernetes NetworkPolicies and CiliumNetworkPolicies can coexist, but they are evaluated independently. A standard NetworkPolicy
allowdoes not override a CiliumNetworkPolicydeny. If you have both types in a namespace, the effective policy is the intersection -- traffic must be allowed by both policy types to flow.
Node-to-Node Connectivity Debug¶
# From a pod, ping another pod's IP directly
kubectl exec -it <pod> -- ping <dest-pod-ip>
# Use cilium-dbg for deep inspection
kubectl exec -n kube-system <cilium-pod> -- cilium-dbg endpoint list
kubectl exec -n kube-system <cilium-pod> -- cilium-dbg bpf policy get <endpoint-id>
# Trace a packet path
kubectl exec -n kube-system <cilium-pod> -- cilium-dbg debuginfo > debuginfo.txt
See Also¶
- K8s Networking
- Deep Dive: Kubernetes Networking
- Case Study: CNI Broken After Restart
- API Gateways & Ingress