Cilium & eBPF Networking - Street-Level Ops¶

Quick Diagnosis Commands¶

# --- Cilium Overall Health ---

# Check Cilium agent status on all nodes
cilium status
cilium status --all-health  # verbose health breakdown

# Check Cilium operator and agent pods
kubectl get pods -n kube-system -l k8s-app=cilium
kubectl get pods -n kube-system -l name=cilium-operator

# Validate Cilium installation
cilium connectivity test  # end-to-end connectivity test (takes ~5 min)

# --- Endpoint Status ---

# List all endpoints on the local node
cilium endpoint list
cilium endpoint list --output json | jq '.[] | {id, name: .status.identity.labels}'

# Get detailed endpoint status
cilium endpoint get <endpoint-id>
cilium endpoint get <pod-name>   # by pod name also works

# Check endpoint policy enforcement
cilium endpoint get <id> | grep -A5 "policy"

# --- Hubble (Observability) ---

# Enable Hubble UI port-forward
cilium hubble ui

# Enable Hubble CLI
cilium hubble port-forward &

# Watch all flows in real time
hubble observe --follow

# Watch flows for a specific pod
hubble observe --from-pod default/myapp --follow

# Watch dropped flows only (policy violations)
hubble observe --verdict DROPPED --follow

# Filter by namespace
hubble observe -n production --follow

# DNS flows
hubble observe --protocol dns --follow

# L7 HTTP flows
hubble observe --protocol http --follow

# --- Network Policy Status ---

# List all CiliumNetworkPolicies
kubectl get ciliumnetworkpolicy -A
kubectl get cnp -A  # shorthand

# Describe a policy
kubectl describe cnp my-policy -n production

# Check what policies apply to a pod
cilium policy get
cilium endpoint get <id> | jq '.status.policy'

# Verify a NetworkPolicy was imported
cilium policy get | grep -A10 "my-policy"

# --- BPF Map Inspection ---

# List all BPF maps
cilium bpf map list

# View CT (connection tracking) table size
cilium bpf ct list global | wc -l

# View NAT table
cilium bpf nat list

# Check LoadBalancer services
cilium bpf lb list

# View endpoint map
cilium bpf endpoint list

# --- Node Connectivity ---

# Check node health (Cilium internal health checks)
cilium-health status
cilium-health status --verbose

# Run path test between nodes
cilium-health ping <node-ip>

# Check if kube-proxy replacement is active
cilium status | grep "KubeProxyReplacement"

Debug clue: When hubble observe --verdict DROPPED shows drops with reason POLICY_DENIED, the source identity and destination identity are printed in the output. Use cilium identity get <id> to resolve the numeric identity back to pod labels -- this tells you exactly which security identity pair is missing from the policy.

Common Scenarios¶

Scenario 1: Pod Cannot Reach Another Pod (Policy Dropped)¶

Traffic between pods is being dropped. Application returns connection refused or timeout.

# Step 1: Check if it's a policy drop
hubble observe --from-pod <namespace>/<source-pod> \
  --to-pod <namespace>/<dest-pod> --verdict DROPPED

# Step 2: If drops found, check the drop reason
hubble observe --from-pod <namespace>/<source-pod> \
  --verdict DROPPED --output json | jq '{
    src: .source.pod_name,
    dst: .destination.pod_name,
    reason: .drop_reason_desc,
    policy: .traffic_direction
  }'

# Step 3: Inspect policies on the destination endpoint
cilium endpoint get <dest-pod-name> | jq '.status.policy.realized'

# Step 4: Check ingress policy on destination
kubectl get cnp -n <namespace> -o yaml | grep -A20 "ingress"

# Step 5: For standard NetworkPolicy, check it's imported correctly
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy <name> -n <namespace>

# Step 6: Temporarily check without policy (test only)
# Add a "allow all" NetworkPolicy to the namespace to confirm it's a policy issue
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-all-debug
  namespace: <namespace>
spec:
  podSelector: {}
  ingress: []  # empty = allow all
  egress: []
  policyTypes: [Ingress, Egress]
EOF
# Test, then delete
kubectl delete networkpolicy allow-all-debug -n <namespace>

Scenario 2: Cilium Agent Crashing or Restarting¶

Cilium agent pods in CrashLoopBackOff or repeatedly restarting.

# Step 1: Check why the agent is failing
kubectl logs -n kube-system <cilium-pod> --previous | tail -50

# Step 2: Check node kernel version (Cilium requires 4.9.17+, some features need 5.10+)
uname -r
# Note: kube-proxy replacement, bandwidth manager, and BBR congestion control require 5.10+

# Step 3: Check BPF filesystem is mounted
mount | grep bpf
# Should show: bpffs on /sys/fs/bpf type bpf

# Step 4: Check for BPF map corruption
cilium bpf map list  # if this errors, maps may be corrupted
# Recovery: restart Cilium agent (it will rebuild maps)
kubectl rollout restart daemonset/cilium -n kube-system

# Step 5: Check Cilium config for misconfiguration
kubectl get configmap cilium-config -n kube-system -o yaml

# Step 6: Check etcd/KVStore connectivity (if using external KV store)
cilium status | grep "KVStore"

# Step 7: Check Cilium operator logs
kubectl logs -n kube-system deploy/cilium-operator | tail -50

# Step 8: Validate Cilium configuration
cilium config view

Scenario 3: kube-proxy Replacement Issue — Services Not Reachable¶

After enabling kubeProxyReplacement=true, ClusterIP or NodePort services stop working.

# Step 1: Verify kube-proxy replacement is fully active
cilium status | grep "KubeProxyReplacement"
# Should show: KubeProxyReplacement: True

# Step 2: Check if kube-proxy is still running (conflict)
kubectl get pods -n kube-system | grep kube-proxy
# If running alongside Cilium kube-proxy replacement: potential conflict

# Step 3: Verify BPF LB maps have service entries
cilium bpf lb list | grep <service-clusterip>

# Step 4: Check service backend health
cilium bpf lb list --by-service | grep <service-name>

# Step 5: Test service reachability from an endpoint
cilium endpoint exec <id> -- curl http://<clusterip>:<port>/health

# Step 6: Verify Cilium has kube-apiserver connectivity (required for kube-proxy replacement)
cilium status | grep "Cluster Mesh"
kubectl logs -n kube-system <cilium-pod> | grep "kube-proxy\|nodePort"

# Step 7: Check if NodePort forwarding is configured correctly
cilium status | grep "NodePort"

Scenario 4: L7 HTTP Policy Not Enforcing¶

CiliumNetworkPolicy with L7 HTTP path matching is not blocking requests it should block.

# Step 1: Confirm the policy is imported and valid
kubectl get cnp myapp-l7-policy -n production
kubectl describe cnp myapp-l7-policy -n production
# Look for "Status: OK"

# Step 2: Check Hubble for L7 flows
hubble observe --from-pod production/client \
  --to-pod production/myapp \
  --protocol http --follow

# Step 3: Confirm Envoy proxy is active on the endpoint (required for L7)
cilium endpoint get <myapp-endpoint-id> | jq '.status.proxy'
# Should show proxy port active

# Step 4: Check proxy (Envoy) logs
cilium proxy log <endpoint-id>

# Step 5: Validate L7 policy syntax
# L7 policy in CiliumNetworkPolicy:
cat <<EOF | kubectl apply -f -
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: myapp-l7-policy
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: myapp
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: client
      toPorts:
        - ports:
            - port: "8080"
              protocol: TCP
          rules:
            http:
              - method: GET
                path: /api/.*    # Only allow GET /api/*
EOF

# Step 6: Check if issue is Envoy not starting
kubectl logs -n kube-system <cilium-pod> | grep -i "envoy\|proxy"

Key Patterns¶

Hubble Observability in Production¶

# Find the top talkers in a namespace
hubble observe -n production --output json | \
  jq -r '[.source.pod_name, .destination.pod_name] | @csv' | \
  sort | uniq -c | sort -rn | head -20

# Policy audit mode: see what would be blocked
hubble observe --verdict DROPPED -n production --output json | \
  jq '{src: .source.pod_name, dst: .destination.pod_name, reason: .drop_reason_desc}'

# DNS failure detection
hubble observe --protocol dns --verdict DROPPED --follow

# Monitor inter-namespace traffic
hubble observe --from-namespace frontend --to-namespace backend --follow

CiliumNetworkPolicy vs Standard NetworkPolicy¶

# Standard Kubernetes NetworkPolicy (limited to L3/L4)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend
  namespace: backend
spec:
  podSelector:
    matchLabels:
      app: api
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: frontend
      ports:
        - port: 8080

---
# CiliumNetworkPolicy (L7 HTTP rules, CIDR, DNS FQDN)
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-external-api
  namespace: backend
spec:
  endpointSelector:
    matchLabels:
      app: api
  egress:
    - toFQDNs:
        - matchName: api.github.com
      toPorts:
        - ports:
            - port: "443"
              protocol: TCP

Gotcha: Standard Kubernetes NetworkPolicies and CiliumNetworkPolicies can coexist, but they are evaluated independently. A standard NetworkPolicy allow does not override a CiliumNetworkPolicy deny. If you have both types in a namespace, the effective policy is the intersection -- traffic must be allowed by both policy types to flow.

Node-to-Node Connectivity Debug¶

# From a pod, ping another pod's IP directly
kubectl exec -it <pod> -- ping <dest-pod-ip>

# Use cilium-dbg for deep inspection
kubectl exec -n kube-system <cilium-pod> -- cilium-dbg endpoint list
kubectl exec -n kube-system <cilium-pod> -- cilium-dbg bpf policy get <endpoint-id>

# Trace a packet path
kubectl exec -n kube-system <cilium-pod> -- cilium-dbg debuginfo > debuginfo.txt