Chaos Engineering: Breaking Things on Purpose
- lesson
- chaos-engineering
- fault-injection
- game-days
- chaos-monkey
- litmus
- steady-state
- l2 ---# Chaos Engineering: Breaking Things on Purpose
Topics: chaos engineering, fault injection, game days, Chaos Monkey, Litmus, steady state Level: L2 (Operations) Time: 45–60 minutes Prerequisites: Basic Kubernetes or infrastructure understanding
The Mission¶
Your system has redundancy: 3 replicas, auto-scaling, health checks, circuit breakers. It should survive a pod crashing, a node dying, or a network hiccup. But does it?
You have two ways to find out: wait for a real failure (3am, during peak traffic, with customers watching), or break things yourself (Tuesday afternoon, controlled conditions, with a rollback plan).
Chaos engineering is the second option.
The Principle¶
"Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's ability to withstand turbulent conditions in production." — Principles of Chaos Engineering (Netflix, 2012)
The key word is confidence. You're not breaking things for fun. You're testing a hypothesis: "If X fails, the system should still work." If the hypothesis is wrong, you found a problem before users did.
Name Origin: Netflix created Chaos Monkey in 2011 — a tool that randomly kills production EC2 instances during business hours. Engineers were forced to build systems that survived instance failure because failure happened daily. The name: a monkey loose in your datacenter, randomly unplugging cables. Netflix later created the Simian Army: Chaos Gorilla (kills an AZ), Chaos Kong (kills a region), Latency Monkey (injects network delays), and Conformity Monkey (shuts down non-compliant instances).
The Chaos Experiment Loop¶
1. Define steady state → "What does 'working' look like?"
(error rate < 1%, p99 < 500ms, all health checks pass)
2. Hypothesize → "If we kill one pod, steady state should hold"
3. Inject failure → Kill the pod (controlled)
4. Observe → Did error rate spike? Did latency increase?
Did the system recover automatically?
5. Learn → If hypothesis held: confidence increased.
If not: found a real weakness. Fix it.
Example experiment¶
Hypothesis: "If one of three API replicas is killed, error rate stays below 1%
and p99 latency stays below 500ms."
Steady state:
- Error rate: 0.2%
- p99 latency: 120ms
- 3 replicas serving traffic
Inject:
kubectl delete pod myapp-xyz --grace-period=0
Observe (for 5 minutes):
- Error rate spiked to 0.8% for 30 seconds (requests to dying pod)
- p99 latency spiked to 450ms for 15 seconds
- New pod started in 12 seconds
- Steady state restored at 2 minutes
Result: PASSED (with margin). Error rate stayed under 1%.
But: 30-second spike is longer than expected. Readiness probe
initialDelaySeconds might be too high.
What to Break (Ranked by Impact)¶
Tier 1: Instance/pod failure (start here)¶
# Kill a random pod
kubectl delete pod $(kubectl get pods -l app=myapp -o name | shuf -n1) --grace-period=0
# Kill a random node (drain it)
kubectl drain $(kubectl get nodes -o name | shuf -n1) --ignore-daemonsets --delete-emptydir-data
If your system can't survive one pod or node dying, nothing else matters.
Tier 2: Network failures¶
# Add 200ms latency to all traffic (using tc)
tc qdisc add dev eth0 root netem delay 200ms
# Drop 10% of packets
tc qdisc add dev eth0 root netem loss 10%
# Block traffic to a specific service (simulate dependency failure)
iptables -A OUTPUT -d 10.0.2.100 -j DROP
# Remove after testing
tc qdisc del dev eth0 root
iptables -D OUTPUT -d 10.0.2.100 -j DROP
Tier 3: Resource exhaustion¶
# Fill disk (controlled — use a tmpfs or limit)
dd if=/dev/zero of=/tmp/fill bs=1M count=500
# CPU stress
stress-ng --cpu 4 --timeout 60
# Memory pressure
stress-ng --vm 2 --vm-bytes 80% --timeout 60
# Clean up
rm /tmp/fill
Tier 4: DNS failure¶
# Break DNS resolution (add to /etc/hosts or modify resolv.conf)
# This simulates CoreDNS failure or upstream DNS outage
iptables -A OUTPUT -p udp --dport 53 -j DROP
# → Every DNS lookup fails. What breaks?
# Restore
iptables -D OUTPUT -p udp --dport 53 -j DROP
Kubernetes Chaos Tools¶
Litmus Chaos¶
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-kill-test
spec:
appinfo:
appns: default
applabel: app=myapp
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "true"
Other tools¶
| Tool | Focus | Complexity |
|---|---|---|
| Litmus | Kubernetes-native, CRD-based | Medium |
| Chaos Mesh | Kubernetes, rich fault types | Medium |
| Gremlin | SaaS, multi-platform | Low (paid) |
| tc + iptables | Network faults, any Linux | Low (manual) |
| stress-ng | CPU/memory/disk stress | Low |
| kill/kubectl delete | Process/pod termination | Lowest |
Game Days: Chaos with a Schedule¶
A game day is a planned chaos experiment with the whole team:
Before (prep):
✓ Define experiments and success criteria
✓ Set up monitoring dashboards
✓ Prepare rollback commands
✓ Notify stakeholders ("we're testing resilience Tuesday 2-4pm")
✓ Have incident response ready (just in case)
During (execute):
✓ Run experiments one at a time
✓ Observe metrics after each
✓ Document what happened
✓ Stop if anything goes wrong
After (learn):
✓ What worked as expected?
✓ What surprised us?
✓ What do we need to fix?
✓ Create action items (tracked, assigned, dated)
Gotcha: Don't run chaos experiments in production without a rollback plan. And don't run them during peak hours the first time. Start in staging. Graduate to production during low-traffic hours. Only test during peak when you're confident the system can handle it.
Flashcard Check¶
Q1: What is the goal of chaos engineering?
Build confidence that the system can withstand turbulent conditions. Not breaking things for fun — testing specific hypotheses about resilience.
Q2: What should you break first?
Pod/instance failure. If your system can't survive one pod dying, testing network failures is pointless.
Q3: Netflix Chaos Monkey — what does it do?
Randomly kills production EC2 instances during business hours. Forces engineers to build systems that survive instance failure as a normal condition.
Q4: Should you run chaos experiments in production?
Eventually yes — that's where real confidence comes from. But start in staging, graduate to low-traffic production hours, and always have a rollback plan.
Takeaways¶
-
Break it yourself before it breaks itself. Tuesday 2pm > Saturday 3am.
-
Start with pod/instance kills. This is the most likely real failure and the easiest to test. If this doesn't work, nothing else matters.
-
Define steady state FIRST. You can't evaluate results without knowing what "normal" looks like. Error rate, latency, health checks — measure before and after.
-
Chaos is not random destruction. It's controlled experimentation with a hypothesis, observation, and learning. The experiment is worthless without the analysis.
-
Game days build team confidence. Knowing your system survives failure (because you've seen it) is better than hoping it does (because the architecture diagram says so).
Exercises¶
-
Define steady state for a local service. Run a simple web server locally (e.g.,
python3 -m http.server 8080). Define three steady-state metrics you would measure: response code, response time, and process presence. Usecurl -o /dev/null -s -w "%{http_code} %{time_total}\n" http://localhost:8080/to collect baseline values. Document your steady state in a text file. -
Simulate process failure and recovery. Start two instances of a simple HTTP server on different ports (
python3 -m http.server 8080 &andpython3 -m http.server 8081 &). Write a loop that curls both every second. Kill one process withkill -9. Observe the error rate in your loop output. Restart the killed process. Document: how long was the gap? Did the other instance stay healthy? -
Inject network latency with tc. In a container or VM (not your host), use
tc qdisc add dev lo root netem delay 200msto add 200ms latency to loopback traffic. Runcurl -w "%{time_total}\n" http://localhost:8080/before and after. Observe the difference. Remove withtc qdisc del dev lo root. Write a hypothesis and result statement in chaos experiment format. -
Simulate disk pressure. Create a tmpfs mount with a small size limit:
mkdir /tmp/chaostest && mount -t tmpfs -o size=10M tmpfs /tmp/chaostest. Write a script that fills it:dd if=/dev/zero of=/tmp/chaostest/fill bs=1M count=20. Observe the error when disk fills. Clean up withumount /tmp/chaostest. Document what application behavior you would expect if this were a real data directory. -
Plan a game day. Write a one-page game day plan for a service you operate (or an imaginary three-tier web app). Include: steady-state definition, three experiments ranked by severity, success criteria for each, rollback steps, and a communication plan (who to notify before, during, and after).
Related Lessons¶
- The Cascading Timeout — the failure modes chaos engineering tests
- How Incident Response Actually Works — what to do when chaos reveals a real problem
- The Split-Brain Nightmare — network partition testing