Chaos Engineering & Fault Injection - Street-Level Ops¶
What experienced chaos practitioners know that the conference talks leave out.
Quick Diagnosis Commands¶
# Check if tc rules are active (leftover from experiments)
tc qdisc show dev eth0
tc -s qdisc show dev eth0 # With statistics
# Check for iptables rules injected by chaos tools
iptables -L -n --line-numbers
iptables -t nat -L -n --line-numbers
# Litmus Chaos status
kubectl get chaosengines -A
kubectl get chaosresults -A
kubectl get chaosexperiments -A
# Chaos Mesh status
kubectl get podchaos -A
kubectl get networkchaos -A
kubectl get stresschaos -A
kubectl get iochaos -A
# Check if chaos experiment is still running (cleanup failed)
kubectl get pods -A | grep -i chaos
kubectl get jobs -A | grep -i chaos
# Verify steady state before experiment
kubectl top pods -n production
kubectl get hpa -n production
curl -s http://api-server:8080/health | jq .
One-liner: The first thing to check after any chaos experiment is whether leftover
tcrules oriptablesentries are still active. Chaos tools inject these at the kernel level — if the cleanup step fails, the "experiment" becomes a permanent production degradation.
Gotcha: Chaos Experiment Didn't Clean Up¶
You ran a Litmus Chaos experiment. The engine pod crashed mid-experiment. The injected failure (tc netem rules, iptables rules, or killed pods) is still active. Your "60-second experiment" has been running for two hours.
Fix: Manually clean up the residual chaos:
# For tc/netem rules stuck on pods:
kubectl exec -it <affected-pod> -- tc qdisc del dev eth0 root
# For Litmus: force-delete the engine
kubectl delete chaosengine <name> -n <namespace> --force
# For Chaos Mesh: delete the experiment object
kubectl delete podchaos <name> -n <namespace>
# Nuclear option: restart the affected pods
kubectl rollout restart deployment/<name> -n <namespace>
Remember: Every chaos experiment needs a cleanup verification step. The experiment template should include: "Verify no residual tc rules, iptables rules, or chaos CRDs remain." Make it a checklist item, not a hope.
Gotcha: Pod Kill Experiment Shows No Impact (False Confidence)¶
You kill pods and the service stays healthy. You declare victory. But you were running 10 replicas and killed 1. The load balancer instantly shifted traffic. You tested nothing meaningful — the experiment was too gentle to surface real failure modes.
Fix: Scale the experiment to match realistic failure scenarios:
# Don't just kill 1 of 10 pods. Test meaningful scenarios:
# - Kill 50% of pods simultaneously
# - Kill pods during a traffic spike
# - Kill pods while a deployment is rolling out
# - Kill the pod that owns the leader election
# - Kill pods on a specific node (simulating node failure)
# Combine with load testing for realistic conditions:
# Terminal 1: generate load
hey -n 10000 -c 50 http://api-server:8080/endpoint
# Terminal 2: inject chaos during load
kubectl delete pod -l app=api-server --force --grace-period=0
Gotcha: Network Delay Experiment Triggers Cascading Timeouts¶
You added 200ms latency between the API and database. Expected: slightly slower responses. Actual: the API's 500ms timeout starts firing. Requests back up. The connection pool exhausts. Health checks fail. Kubernetes restarts the pods. New pods hit the same slow database and crash too. Cascading failure from a "gentle" 200ms delay.
Fix: This is actually a valuable finding — your system can't handle network jitter. But control the blast radius:
Under the hood:
tc netem delay 200ms 10msadds 200ms +/- 10ms jitter using a uniform distribution. Real network latency follows a long-tail distribution — occasional spikes of 500ms+ with a 50ms median. Usetc netem delay 50ms 200ms distribution paretofor more realistic simulation.# Start with much smaller delays tc qdisc add dev eth0 root netem delay 20ms 10ms # Increase incrementally: 20ms → 50ms → 100ms → 200ms # Stop when you see degradation. That's your resilience boundary. # The fix for the underlying issue: # 1. Increase timeouts to handle realistic network variance # 2. Add circuit breakers (fail fast instead of hanging) # 3. Add connection pool limits with overflow rejection # 4. Add retry with exponential backoff (not immediate retry)
Gotcha: Chaos Mesh / Litmus RBAC Insufficient¶
You install Chaos Mesh but experiments fail with permission errors. The chaos controller can't inject faults because it doesn't have the right RBAC to modify pods in your target namespace.
Fix:
# Verify the chaos service account has required permissions
kubectl auth can-i delete pods -n production \
--as=system:serviceaccount:chaos-testing:chaos-controller-manager
# For Litmus: the ChaosServiceAccount needs explicit permission
kubectl create clusterrolebinding litmus-admin \
--clusterrole=cluster-admin \
--serviceaccount=litmus:litmus-admin
# For Chaos Mesh: check the controller's ClusterRole
kubectl get clusterrole chaos-mesh-chaos-controller-manager -o yaml
Gotcha: Giving chaos tooling
cluster-adminin production means a bug in the chaos controller can delete any resource in any namespace. Scope it to specific namespaces and resource types. The chaos tool should never have more permission than the failure mode it's simulating.
Pattern: Progressive Chaos Adoption¶
Month 1: Foundation
├── Install chaos tooling in staging
├── Run first pod-kill experiment on a non-critical service
├── Document the experiment and results
├── Present findings to the team
└── Fix any issues discovered
Month 2: Expand scope
├── Network delay experiments in staging
├── CPU/memory stress experiments
├── Run experiments on critical services (staging only)
├── Start building a catalog of experiments
└── Integrate experiments into CI/CD (staging gate)
Month 3: Production readiness
├── Run pod-kill experiments in production (canary first)
├── First game day with the full team
├── Validate alerting fires correctly during experiments
├── Document runbooks based on discovered failure modes
└── Establish regular game day cadence (monthly)
Month 4+: Continuous chaos
├── Automated experiments running on schedule
├── Chaos experiments as part of deployment validation
├── Expand to infrastructure-level experiments (node drain, AZ failure)
├── Cross-team game days (what happens when upstream service fails?)
└── Chaos engineering metrics in SRE reports
Pattern: The Chaos Experiment Template¶
Use this template for every experiment:
## Experiment: [Name]
Date: YYYY-MM-DD
Owner: [Name]
Environment: staging | production
### Steady State
- Request success rate: >99.5%
- P99 latency: <200ms
- Error rate: <0.5%
- Active pods: 3/3
### Hypothesis
When we [inject failure X], the steady-state indicators
remain within acceptable bounds because [expected behavior].
### Method
Tool: [Chaos Mesh / Litmus / tc / manual]
Target: [service name, namespace, label selector]
Failure type: [pod-kill / network-delay / cpu-stress / etc.]
Duration: [seconds]
Blast radius: [percentage of pods / specific node / etc.]
### Abort Conditions
- Error rate exceeds 5%
- P99 latency exceeds 2 seconds
- Any 5xx rate sustained for >30 seconds
- Customer impact detected
### Abort Procedure
1. Delete the chaos experiment: kubectl delete [resource] [name]
2. Verify recovery: check dashboards
3. If no recovery in 60s: rollout restart deployment
### Results
- Hypothesis held: [yes / no]
- Observations: [what happened]
- Surprises: [unexpected behavior]
- Action items: [fixes needed]
Under the hood: Chaos Mesh uses Linux kernel namespaces and cgroups to inject faults at the container level. Pod-kill uses the Kubernetes API. Network chaos injects
tc netemrules inside the pod's network namespace. Stress chaos usesstress-ngin the target's cgroup. Understanding the injection mechanism helps you predict side effects and debug cleanup failures.
Pattern: Combining Chaos with Load Testing¶
The most realistic experiments combine fault injection with production-like load:
# Step 1: Establish baseline under load
hey -n 50000 -c 100 -q 200 http://api:8080/endpoint > baseline.txt
# Step 2: Run the same load with chaos active
# In another terminal, start the chaos experiment
kubectl apply -f chaos-experiment.yaml
# In the first terminal, run the same load
hey -n 50000 -c 100 -q 200 http://api:8080/endpoint > chaos.txt
# Step 3: Compare
# baseline.txt vs chaos.txt:
# - Latency distribution shift
# - Error rate change
# - Throughput degradation
Emergency: Chaos Experiment Caused a Real Outage¶
The experiment escaped its blast radius. Production is impacted.
1. STOP the experiment immediately:
kubectl delete chaosengine --all -n chaos-testing # Litmus
kubectl delete podchaos --all -A # Chaos Mesh
# Or kill the chaos controller entirely:
kubectl scale deployment chaos-controller -n chaos-testing --replicas=0
2. Verify chaos artifacts are cleaned up:
# Check for residual tc rules on affected pods
for pod in $(kubectl get pods -l app=api-server -o name); do
kubectl exec $pod -- tc qdisc show dev eth0
done
3. Recover the service:
kubectl rollout restart deployment/api-server -n production
# Wait for healthy pods
kubectl rollout status deployment/api-server -n production
4. Post-incident:
- This IS an incident. Write a postmortem.
- Root cause is the experiment design, not the chaos tool.
- What blast radius control was missing?
- What abort condition should have triggered?
Emergency: tc Rules Stuck on a Production Node¶
Someone ran tc qdisc add directly on a node for a "quick test" and forgot to remove it. All pods on that node have degraded networking.
# On the affected node:
# Show all active tc rules
tc -s qdisc show
# Remove all tc rules from the interface
tc qdisc del dev eth0 root 2>/dev/null
tc qdisc del dev cni0 root 2>/dev/null
tc qdisc del dev flannel.1 root 2>/dev/null
# Verify clean state (should show only default pfifo_fast or fq_codel)
tc qdisc show dev eth0
# If pods are still affected, restart them to get fresh network setup
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubectl uncordon <node-name>
Prevention: never run tc directly on production nodes. Use Chaos Mesh or similar tools that handle cleanup automatically. Add a cron job that checks for unexpected tc rules:
War story: An engineer ran
tc qdisc add dev eth0 root netem loss 1%on a production node "just to test something real quick" and forgot to remove it. For two weeks, 1% of all packets on that node were silently dropped. Intermittent timeout alerts fired but never consistently enough to trigger investigation. A routine chaos audit finally found the leftover rule.