Chaos Engineering: Breaking Things on Purpose

lesson
chaos-engineering
fault-injection
game-days
chaos-monkey
litmus
steady-state
l2 ---# Chaos Engineering: Breaking Things on Purpose

Topics: chaos engineering, fault injection, game days, Chaos Monkey, Litmus, steady state Level: L2 (Operations) Time: 45–60 minutes Prerequisites: Basic Kubernetes or infrastructure understanding

The Mission¶

Your system has redundancy: 3 replicas, auto-scaling, health checks, circuit breakers. It should survive a pod crashing, a node dying, or a network hiccup. But does it?

You have two ways to find out: wait for a real failure (3am, during peak traffic, with customers watching), or break things yourself (Tuesday afternoon, controlled conditions, with a rollback plan).

Chaos engineering is the second option.

The Principle¶

"Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's ability to withstand turbulent conditions in production." — Principles of Chaos Engineering (Netflix, 2012)

The key word is confidence. You're not breaking things for fun. You're testing a hypothesis: "If X fails, the system should still work." If the hypothesis is wrong, you found a problem before users did.

Name Origin: Netflix created Chaos Monkey in 2011 — a tool that randomly kills production EC2 instances during business hours. Engineers were forced to build systems that survived instance failure because failure happened daily. The name: a monkey loose in your datacenter, randomly unplugging cables. Netflix later created the Simian Army: Chaos Gorilla (kills an AZ), Chaos Kong (kills a region), Latency Monkey (injects network delays), and Conformity Monkey (shuts down non-compliant instances).

The Chaos Experiment Loop¶

1. Define steady state → "What does 'working' look like?"
   (error rate < 1%, p99 < 500ms, all health checks pass)

2. Hypothesize → "If we kill one pod, steady state should hold"

3. Inject failure → Kill the pod (controlled)

4. Observe → Did error rate spike? Did latency increase?
   Did the system recover automatically?

5. Learn → If hypothesis held: confidence increased.
   If not: found a real weakness. Fix it.

Example experiment¶

Hypothesis: "If one of three API replicas is killed, error rate stays below 1%
            and p99 latency stays below 500ms."

Steady state:
  - Error rate: 0.2%
  - p99 latency: 120ms
  - 3 replicas serving traffic

Inject:
  kubectl delete pod myapp-xyz --grace-period=0

Observe (for 5 minutes):
  - Error rate spiked to 0.8% for 30 seconds (requests to dying pod)
  - p99 latency spiked to 450ms for 15 seconds
  - New pod started in 12 seconds
  - Steady state restored at 2 minutes

Result: PASSED (with margin). Error rate stayed under 1%.
But: 30-second spike is longer than expected. Readiness probe
     initialDelaySeconds might be too high.

What to Break (Ranked by Impact)¶

Tier 1: Instance/pod failure (start here)¶

# Kill a random pod
kubectl delete pod $(kubectl get pods -l app=myapp -o name | shuf -n1) --grace-period=0

# Kill a random node (drain it)
kubectl drain $(kubectl get nodes -o name | shuf -n1) --ignore-daemonsets --delete-emptydir-data

If your system can't survive one pod or node dying, nothing else matters.

Tier 2: Network failures¶

# Add 200ms latency to all traffic (using tc)
tc qdisc add dev eth0 root netem delay 200ms

# Drop 10% of packets
tc qdisc add dev eth0 root netem loss 10%

# Block traffic to a specific service (simulate dependency failure)
iptables -A OUTPUT -d 10.0.2.100 -j DROP

# Remove after testing
tc qdisc del dev eth0 root
iptables -D OUTPUT -d 10.0.2.100 -j DROP

Tier 3: Resource exhaustion¶

# Fill disk (controlled — use a tmpfs or limit)
dd if=/dev/zero of=/tmp/fill bs=1M count=500

# CPU stress
stress-ng --cpu 4 --timeout 60

# Memory pressure
stress-ng --vm 2 --vm-bytes 80% --timeout 60

# Clean up
rm /tmp/fill

Tier 4: DNS failure¶

# Break DNS resolution (add to /etc/hosts or modify resolv.conf)
# This simulates CoreDNS failure or upstream DNS outage
iptables -A OUTPUT -p udp --dport 53 -j DROP
# → Every DNS lookup fails. What breaks?

# Restore
iptables -D OUTPUT -p udp --dport 53 -j DROP

Kubernetes Chaos Tools¶

Litmus Chaos¶

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-kill-test
spec:
  appinfo:
    appns: default
    applabel: app=myapp
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: FORCE
              value: "true"

Other tools¶

Tool	Focus	Complexity
Litmus	Kubernetes-native, CRD-based	Medium
Chaos Mesh	Kubernetes, rich fault types	Medium
Gremlin	SaaS, multi-platform	Low (paid)
tc + iptables	Network faults, any Linux	Low (manual)
stress-ng	CPU/memory/disk stress	Low
kill/kubectl delete	Process/pod termination	Lowest

Game Days: Chaos with a Schedule¶

A game day is a planned chaos experiment with the whole team:

Before (prep):
  ✓ Define experiments and success criteria
  ✓ Set up monitoring dashboards
  ✓ Prepare rollback commands
  ✓ Notify stakeholders ("we're testing resilience Tuesday 2-4pm")
  ✓ Have incident response ready (just in case)

During (execute):
  ✓ Run experiments one at a time
  ✓ Observe metrics after each
  ✓ Document what happened
  ✓ Stop if anything goes wrong

After (learn):
  ✓ What worked as expected?
  ✓ What surprised us?
  ✓ What do we need to fix?
  ✓ Create action items (tracked, assigned, dated)

Gotcha: Don't run chaos experiments in production without a rollback plan. And don't run them during peak hours the first time. Start in staging. Graduate to production during low-traffic hours. Only test during peak when you're confident the system can handle it.

Flashcard Check¶

Q1: What is the goal of chaos engineering?

Build confidence that the system can withstand turbulent conditions. Not breaking things for fun — testing specific hypotheses about resilience.

Q2: What should you break first?

Pod/instance failure. If your system can't survive one pod dying, testing network failures is pointless.

Q3: Netflix Chaos Monkey — what does it do?

Randomly kills production EC2 instances during business hours. Forces engineers to build systems that survive instance failure as a normal condition.

Q4: Should you run chaos experiments in production?

Eventually yes — that's where real confidence comes from. But start in staging, graduate to low-traffic production hours, and always have a rollback plan.

Takeaways¶

Break it yourself before it breaks itself. Tuesday 2pm > Saturday 3am.
Start with pod/instance kills. This is the most likely real failure and the easiest to test. If this doesn't work, nothing else matters.
Define steady state FIRST. You can't evaluate results without knowing what "normal" looks like. Error rate, latency, health checks — measure before and after.
Chaos is not random destruction. It's controlled experimentation with a hypothesis, observation, and learning. The experiment is worthless without the analysis.
Game days build team confidence. Knowing your system survives failure (because you've seen it) is better than hoping it does (because the architecture diagram says so).

Exercises¶

Define steady state for a local service. Run a simple web server locally (e.g., python3 -m http.server 8080). Define three steady-state metrics you would measure: response code, response time, and process presence. Use curl -o /dev/null -s -w "%{http_code} %{time_total}\n" http://localhost:8080/ to collect baseline values. Document your steady state in a text file.
Simulate process failure and recovery. Start two instances of a simple HTTP server on different ports (python3 -m http.server 8080 & and python3 -m http.server 8081 &). Write a loop that curls both every second. Kill one process with kill -9. Observe the error rate in your loop output. Restart the killed process. Document: how long was the gap? Did the other instance stay healthy?
Inject network latency with tc. In a container or VM (not your host), use tc qdisc add dev lo root netem delay 200ms to add 200ms latency to loopback traffic. Run curl -w "%{time_total}\n" http://localhost:8080/ before and after. Observe the difference. Remove with tc qdisc del dev lo root. Write a hypothesis and result statement in chaos experiment format.
Simulate disk pressure. Create a tmpfs mount with a small size limit: mkdir /tmp/chaostest && mount -t tmpfs -o size=10M tmpfs /tmp/chaostest. Write a script that fills it: dd if=/dev/zero of=/tmp/chaostest/fill bs=1M count=20. Observe the error when disk fills. Clean up with umount /tmp/chaostest. Document what application behavior you would expect if this were a real data directory.
Plan a game day. Write a one-page game day plan for a service you operate (or an imaginary three-tier web app). Include: steady-state definition, three experiments ranked by severity, success criteria for each, rollback steps, and a communication plan (who to notify before, during, and after).

The Cascading Timeout — the failure modes chaos engineering tests
How Incident Response Actually Works — what to do when chaos reveals a real problem
The Split-Brain Nightmare — network partition testing