Chaos Engineering Footguns¶

Mistakes that turn controlled experiments into uncontrolled outages.

1. Running your first chaos experiment in production¶

You installed Litmus Chaos, watched a demo, and decided to try a pod-kill experiment on your production API server. No staging test first. The experiment kills more pods than expected because the label selector matched pods in two namespaces. Half your API tier goes down.

Fix: Always run experiments in staging first. Validate that label selectors match exactly the pods you intend to target. Start with a single pod, watch the blast radius, then expand. Production chaos comes after you've built confidence in staging.

War story: A large e-commerce company ran their first-ever chaos experiment in production to "prove we're resilient." The pod-kill experiment triggered a cascading failure that took down the entire checkout flow for 23 minutes during a sale. The post-incident review revealed they had never tested the circuit breakers they assumed would protect them. Staging would have revealed this for free.

2. No abort mechanism¶

Your experiment is running. Error rates are climbing. You realize you don't know how to stop it. The chaos controller is still injecting faults while you scramble to find the right kubectl command. What should have been a 60-second experiment becomes a 10-minute outage.

Fix: Document the abort procedure before starting. Test the abort procedure before running the experiment. Have the kill command ready in a terminal: kubectl delete chaosengine <name> -n <namespace>. Automate abort triggers: if error rate exceeds a threshold, stop the experiment automatically.

Remember: Netflix's Chaos Monkey principle: every experiment must have a "big red button." Before running any chaos test, answer three questions: (1) How do I stop it? (2) How long will it take to stop? (3) What's the blast radius if stopping fails? If you can't answer all three, you're not ready.

3. Forgetting that chaos tools need cleanup¶

You ran a network delay experiment via Chaos Mesh. The experiment CRD was deleted but the tc rules injected into the pod's network namespace are still active. Your service has had 200ms of added latency for three days. Nobody noticed because it looked like "normal" network variance.

Fix: After every experiment, verify the cleanup completed. Check tc qdisc show on affected pods. Restart affected deployments if in doubt. Monitor latency baselines — a sudden persistent shift after an experiment is a cleanup failure.

Gotcha: Chaos Mesh and Litmus inject tc rules into the pod's network namespace via the container runtime. If the chaos controller pod crashes mid-experiment, the cleanup finalizer never runs and the injected rules persist until the affected pod is restarted. Always verify cleanup independently of the chaos tool's status.

4. Using chaos engineering to test in lieu of proper testing¶

You skip unit tests, integration tests, and load tests because "we do chaos engineering." Chaos engineering tests system resilience — it doesn't test correctness. Your service might survive a pod kill gracefully while returning wrong data to every customer.

Fix: Chaos engineering is one layer in the testing pyramid, not a replacement for it. You need unit tests (correctness), integration tests (contracts), load tests (capacity), and chaos tests (resilience). Each tests something different.

Remember: Netflix's "Principles of Chaos Engineering" defines chaos as: "the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production." Key word: confidence. Chaos engineering builds confidence in existing resilience mechanisms — it doesn't create them.

5. Label selectors that match too broadly¶

Your chaos experiment targets app: web. You have pods with that label in production, staging, and the monitoring namespace. The experiment kills pods across all three environments. Your monitoring goes down at the exact moment you need it most.

Fix: Always use namespace selectors in addition to label selectors. Be specific: namespace: production AND app: api-server AND version: v2. Verify the selector before applying: kubectl get pods -l app=api-server -n production to see exactly what matches.

6. Running experiments during deployment rollouts¶

You scheduled a chaos experiment via cron. It fires during an unrelated deployment rollout. Kubernetes is trying to roll out new pods while the chaos tool is killing them. The rollout stalls. The old pods are being killed by chaos. The new pods can't start because resource limits are hit. Both the deployment and the experiment fail catastrophically.

Fix: Integrate chaos scheduling with deployment awareness. Pause automated experiments during deployments. Use a deployment lock or feature flag that chaos tools check before executing. At minimum, don't schedule automated chaos during known deployment windows.

7. Injecting disk fill on a shared volume¶

Your disk-fill experiment targets a PVC. That PVC is shared by multiple pods (ReadWriteMany). The experiment fills it to 95%. All pods on that volume start failing — including the log aggregator that was supposed to capture the experiment results. You can't see what happened because the evidence volume is full.

Fix: Never target shared volumes without understanding all consumers. Prefer pod-local ephemeral storage for disk-fill experiments. If targeting PVCs, make sure your observability stack uses separate storage. Keep fill percentage conservative (70-80%, not 95%).

8. tc netem on the wrong network interface¶

You run tc qdisc add dev eth0 root netem delay 200ms on a Kubernetes node. But the pod traffic flows through cni0 or flannel.1 or veth interfaces, not eth0. eth0 is the management interface. You just added 200ms latency to your SSH session and the Kubernetes API server connection. The node becomes unresponsive.

Fix: Understand your CNI's network topology before using tc directly. In most cases, use Chaos Mesh or similar tools that inject tc rules into the pod's network namespace, not the host. If you must use tc directly, target the correct interface and test on a non-critical node first.

9. Chaos experiments without observability¶

You run a pod-kill experiment. It completes. The chaos tool reports "success." But you have no idea what happened to your application during the experiment. Did latency spike? Did errors increase? Did the load balancer reroute correctly? Without observability, you're flying blind.

Fix: Set up dashboards before running experiments. At minimum: request rate, error rate, latency (p50/p95/p99), and pod count. Watch the dashboard during the experiment. Record the metrics. If you can't see the impact, the experiment produced zero value.

10. Confusing "the service stayed up" with "the service handled it gracefully"¶

You killed a pod. The service responded to health checks and returned 200s. You call it a success. But during the pod restart, 50 in-flight requests got dropped, 3 WebSocket connections died without reconnect, and one background job lost 10 minutes of work because it wasn't checkpointing.

Fix: Measure more than just availability. Check: in-flight request handling (graceful shutdown), connection draining, data loss (jobs, queues, caches), client-side experience (retries, reconnects), and recovery time (how long until all metrics return to baseline). A service that drops requests on every pod restart is not resilient — it's just not down.

Under the hood: Kubernetes sends SIGTERM, then waits terminationGracePeriodSeconds (default 30s), then sends SIGKILL. If your app doesn't handle SIGTERM (trap it, stop accepting new requests, drain in-flight), every pod kill drops active requests. Test this explicitly: send a slow request, kill the pod, check if the response completes.