Cilium & eBPF Networking Footguns¶
Mistakes that cause outages or security incidents.
1. Running kube-proxy Alongside kube-proxy Replacement¶
You enable kubeProxyReplacement=true in Cilium's config (or Helm values) but forget to delete the kube-proxy DaemonSet. Both Cilium and kube-proxy try to manage iptables rules and BPF programs for the same services. Services become unreachable or intermittently flaky as the two systems fight over NAT table entries. The failure mode is non-obvious and hard to trace.
Fix: When enabling kube-proxy replacement, explicitly disable or delete the kube-proxy DaemonSet before or simultaneously with enabling the Cilium setting. Verify with cilium status | grep KubeProxyReplacement — it should report True. Run kubectl get pods -n kube-system | grep kube-proxy and confirm none are running. On managed clusters (EKS, GKE), check if the provider's control plane re-creates kube-proxy automatically.
Debug clue: Intermittent Service connectivity issues after enabling kube-proxy replacement? Check
iptables -t nat -L | wc -l— if you see thousands of NAT rules, kube-proxy is still writing iptables entries that conflict with Cilium's BPF maps. The rule count should be near-zero when kube-proxy is fully removed.
2. Forgetting FQDN Policy Requires Hubble or DNS Proxy¶
You write a CiliumNetworkPolicy egress rule using toFQDNs to allow traffic to api.github.com. The policy is created and shows Status: OK. But DNS requests bypass Cilium's DNS proxy, so Cilium never sees the resolved IP, and all traffic to api.github.com is dropped. Applications start throwing connection refused to external services.
Fix: FQDN-based policies require Cilium's DNS proxy to intercept DNS responses and learn IP-to-domain mappings. Ensure dnsProxy is enabled in Cilium config. Also write a toFQDNs DNS port rule to allow the DNS lookup itself:
egress:
- toEndpoints:
- matchLabels:
k8s:io.kubernetes.pod.namespace: kube-system
toPorts:
- ports: [{port: "53", protocol: ANY}]
rules:
dns: [{matchPattern: "*"}]
- toFQDNs:
- matchName: api.github.com
3. CiliumNetworkPolicy Denies All Traffic When Partially Applied¶
You create a CiliumNetworkPolicy that specifies only egress rules for a pod. Cilium interprets this as: "allow only the listed egress, deny all ingress" — because specifying a policy type implicitly denies all traffic in that direction unless explicitly allowed. Your service stops receiving traffic from its own load balancer or ingress.
Fix: In Cilium (like standard Kubernetes NetworkPolicy), a policy is additive — once a pod is selected, any direction not covered by at least one rule is denied. If you write egress rules, add explicit ingress rules too. Start with an allow-all ingress rule and tighten incrementally. Use Hubble (hubble observe --verdict DROPPED) to identify what's being blocked before going to production.
4. Kernel Version Too Old for Required Features¶
You deploy Cilium with features like bandwidthManager, endpointRoutes, or wireguard enabled, but the cluster nodes run kernel 4.15 or older. Cilium agents crash on startup with cryptic BPF compilation errors or silently fall back to a degraded mode that doesn't enforce your expected policies. Node kernel version is often overlooked on long-running bare-metal clusters or certain cloud VM images.
Fix: Check Cilium's feature requirements matrix before enabling features — most advanced features require kernel 5.10+. Run uname -r on all nodes and verify against docs.cilium.io. Upgrade kernels before enabling the feature, not after. In Cilium Helm values, set kubeProxyReplacement=false if kernel is below 4.19.57 to avoid a hard crash.
Gotcha: Ubuntu 20.04 LTS ships with kernel 5.4, which lacks BPF features needed for Cilium's bandwidth manager and host routing. Ubuntu 22.04 (kernel 5.15) or a backported HWE kernel is the minimum for full Cilium feature support.
5. Default-Deny Policy Applied Cluster-Wide Without Allowlist¶
You apply a "default deny all" CiliumNetworkPolicy to all namespaces as a security baseline before writing the allow rules. DNS breaks immediately — pods can't resolve service names. Then kube-apiserver health checks fail (webhook calls from the API server are denied). Then Prometheus stops scraping metrics. Within minutes, multiple unrelated systems are broken in cascade.
Fix: Apply default-deny policies namespace by namespace, not cluster-wide in one step. Always write the allowlist rules before or simultaneously with the deny policy — especially for DNS (UDP/TCP port 53 to kube-dns), and for the health check port of any admission webhook. Use Hubble in DROPPED verdict mode for 24 hours in each namespace before enforcing. Test in staging with cilium connectivity test.
6. Hubble Not Deployed, Debugging in the Dark¶
Cilium is deployed without Hubble. When a network issue occurs in production — pods can't reach each other, DNS is flaky, policy is silently dropping traffic — there's no observability layer to diagnose the problem. You're forced to use tcpdump inside pods, which is slow, requires privileges, and doesn't understand Cilium's policy context.
Fix: Deploy Hubble from day one. It's part of the Cilium Helm chart:
helm upgrade cilium cilium/cilium --set hubble.enabled=true \
--set hubble.relay.enabled=true \
--set hubble.ui.enabled=true
hubble.metrics.enabled to export Hubble metrics to Prometheus for persistent observability.
7. Identity-Based Policy Broken by Label Changes¶
Cilium builds security identities from pod labels. A CiliumNetworkPolicy says "allow traffic from pods with label app=frontend." A developer renames the label to component=frontend in a Deployment update. The pods get a new Cilium identity, the old policy no longer matches, and frontend traffic to backend is silently dropped. The application appears down even though all pods are running.
Fix: Treat Cilium identity labels as a contract — changing them changes the security identity and can break policies. Use namespaceSelector in addition to podSelector so label renames in one namespace don't cascade. Document which labels are used in network policies and include them in your change management process. After any Deployment label change, run hubble observe to confirm expected traffic flows are still allowed.
8. Cilium Upgrade with Breaking CRD Changes¶
You upgrade Cilium from 1.13 to 1.15 using helm upgrade without reading the migration guide. The new version introduces CRD schema changes. Existing CiliumNetworkPolicy objects that were valid in 1.13 are now rejected or silently ignored because a field was renamed or restructured. Network policies that were protecting workloads stop enforcing — the cluster is now effectively open — with no alerts because the pods are still running.
Fix: Always read the Cilium upgrade guide for the target version before upgrading. Cilium versions are significant changes — check the CHANGELOG.md for CRD changes. Run cilium preflight install to pre-install new CRDs before the agent upgrade. Test the upgrade in staging with cilium connectivity test before production. Use kubectl get cnp -A -o yaml before and after to diff policy objects.
War story: The Cilium 1.13 to 1.14 upgrade renamed
CiliumClusterwideNetworkPolicyCRD fields. Teams that skipped the preflight check found their cluster-wide deny policies silently stopped enforcing, leaving workloads unprotected until the drift was discovered during a routine audit.