Skip to content

Portal | Level: L2: Operations | Topics: Chaos Engineering, Kubernetes Core | Domain: DevOps & Tooling

Chaos Engineering & Fault Injection - Primer

Why This Matters

Every distributed system fails. The question is whether you discover the failure modes in a controlled experiment at 2pm on a Tuesday, or at 3am on a Saturday when your biggest customer's traffic is peaking. Chaos engineering is the discipline of proactively injecting failures to find weaknesses before they become outages. It's not about breaking things for fun — it's about building confidence in your system's ability to withstand turbulent conditions.

Who made it: Netflix released Chaos Monkey as open source in 2012. It was part of the Simian Army (Chaos Gorilla for AZ failure, Chaos Kong for region failure, Latency Monkey for network delays). The term "chaos engineering" was formalized by Casey Rosenthal and Nora Jones in the 2017 O'Reilly book Chaos Engineering.

Netflix popularized this with Chaos Monkey, but you don't need Netflix-scale to benefit. If you run more than one service that talks to another service, you have a distributed system with failure modes you haven't discovered yet. Chaos engineering helps you find them.

Core Concepts

1. The Scientific Method Applied to Infrastructure

Chaos engineering follows a structured process, not random destruction:

┌───────────────────────────────────────────────────────┐
  1. Define steady state                                    What does "healthy" look like? (metrics, SLIs)                                                              2. Hypothesize                                             "If we kill 1 of 3 API pods, latency stays <200ms"                                                          3. Introduce variables                                     Inject the failure (pod kill, network delay, etc.)                                                           4. Observe                                                 Did steady state hold? What broke? What recovered?                                                           5. Learn                                                   Document findings. Fix weaknesses. Repeat.         └───────────────────────────────────────────────────────┘

The difference between chaos engineering and just breaking things:

Chaos Engineering Just Breaking Things
Hypothesis first "Let's see what happens"
Steady-state defined No baseline
Blast radius controlled Anything goes
Automated experiments Manual destruction
Results documented No record
Builds confidence Creates fear

2. Steady-State Hypothesis

Before injecting any failure, you need to define what "normal" looks like. This is your steady-state hypothesis.

Steady-State Indicators (pick the relevant ones):
├── Request success rate      > 99.5%
├── P99 latency               < 500ms
├── Error rate                 < 0.5%
├── Queue depth                < 1000
├── Pod restart count          = 0
├── Database connection pool   < 80% utilized
└── CPU utilization            < 70% across nodes

Your hypothesis takes the form: "When we inject [failure X], these steady-state indicators remain within acceptable bounds."

If the hypothesis holds, you've gained confidence. If it fails, you've found a weakness before your customers did.

3. Kubernetes Pod Experiments

The most common starting point — kill pods and see what happens:

# Litmus Chaos: pod-delete experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: api-pod-delete
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: "app=api-server"
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"           # Run for 60 seconds
            - name: CHAOS_INTERVAL
              value: "10"           # Kill a pod every 10 seconds
            - name: FORCE
              value: "false"        # Graceful termination first
            - name: PODS_AFFECTED_PERC
              value: "50"           # Kill 50% of matching pods
# Chaos Mesh: pod-kill experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: api-pod-kill
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one                          # Kill one pod at a time
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  duration: "60s"
  scheduler:
    cron: "@every 2m"                # Repeat every 2 minutes

4. Network Fault Injection

Network failures are the most common and least tested failure mode:

Under the hood: tc (traffic control) uses the Linux netem (network emulator) queueing discipline. It operates at the kernel level on egress — packets are delayed, dropped, or reordered after the application sends them but before they hit the wire. This is the same mechanism that tools like Chaos Mesh use internally.

# tc (traffic control) — built into Linux, no dependencies
# Add 200ms latency to all traffic on eth0
tc qdisc add dev eth0 root netem delay 200ms

# Add 200ms latency with 50ms jitter (more realistic)
tc qdisc add dev eth0 root netem delay 200ms 50ms

# 5% packet loss
tc qdisc add dev eth0 root netem loss 5%

# Combine: latency + packet loss + reordering
tc qdisc add dev eth0 root netem delay 100ms 25ms loss 2% reorder 10%

# Remove all tc rules
tc qdisc del dev eth0 root

# Target specific destination (more surgical)
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 500ms
tc filter add dev eth0 parent 1:0 protocol ip u32 \
  match ip dst 10.0.1.50/32 flowid 1:3
# Chaos Mesh: network chaos
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: api-network-delay
  namespace: chaos-testing
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  delay:
    latency: "300ms"
    jitter: "50ms"
    correlation: "75"
  direction: to
  target:
    selector:
      namespaces:
        - production
      labelSelectors:
        app: database
  duration: "120s"

5. CPU and Memory Stress

Test how your services behave under resource pressure:

# stress-ng — comprehensive stress testing tool
# CPU stress: 4 workers, 60 second duration
stress-ng --cpu 4 --timeout 60s

# Memory stress: allocate 2GB
stress-ng --vm 2 --vm-bytes 2G --timeout 60s

# Disk I/O stress
stress-ng --hdd 2 --timeout 60s

# Combined (realistic: everything at once)
stress-ng --cpu 2 --vm 1 --vm-bytes 1G --hdd 1 --timeout 120s
# Chaos Mesh: stress chaos
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: api-cpu-stress
  namespace: chaos-testing
spec:
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  stressors:
    cpu:
      workers: 2
      load: 80                       # 80% CPU pressure
    memory:
      workers: 1
      size: "512MB"
  duration: "120s"

6. Disk Pressure Testing

# Fill a filesystem to trigger pressure responses
# WARNING: do this on test volumes, never on /
dd if=/dev/zero of=/mnt/data/fill_file bs=1M count=10000

# Watch how your application responds:
# - Does it log errors clearly?
# - Does the health check fail?
# - Does Kubernetes evict the pod?
# - Does the PVC resize trigger work?

# Clean up
rm /mnt/data/fill_file
# Litmus Chaos: disk-fill experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: disk-fill-test
spec:
  appinfo:
    appns: production
    applabel: "app=database"
    appkind: statefulset
  experiments:
    - name: disk-fill
      spec:
        components:
          env:
            - name: FILL_PERCENTAGE
              value: "90"           # Fill to 90%
            - name: TOTAL_CHAOS_DURATION
              value: "120"
            - name: EPHEMERAL_STORAGE_MEBIBYTES
              value: ""             # Target PVC, not ephemeral

7. Game Day Planning

A game day is a scheduled chaos experiment with the whole team watching:

Game Day Runbook:

Before:
├── 1. Define scope and blast radius (which services, which env)
├── 2. Write the steady-state hypothesis
├── 3. Prepare rollback plan (how to stop the experiment immediately)
├── 4. Notify stakeholders (other teams, on-call, management)
├── 5. Set up dashboards with steady-state metrics visible
├── 6. Confirm the experiment automation is tested in staging
└── 7. Schedule during business hours (people available to respond)

During:
├── 1. Announce start on incident channel
├── 2. Run experiment
├── 3. Observe dashboards — everyone watches the same screen
├── 4. Note anomalies, unexpected behaviors, and recovery times
├── 5. Stop immediately if blast radius exceeds plan
└── 6. Announce completion

After:
├── 1. Document results vs. hypothesis
├── 2. Create tickets for discovered weaknesses
├── 3. Prioritize fixes by blast radius and likelihood
├── 4. Schedule follow-up experiment to verify fixes
└── 5. Share findings with the broader engineering org

8. Blast Radius Control

The most important concept in chaos engineering — limit the scope of damage:

Blast Radius Progression (start small, grow with confidence):

Level 1: Single pod in staging
         Risk: minimal
         Learning: does the service handle pod restarts?

Level 2: Multiple pods in staging
         Risk: staging instability
         Learning: does the deployment maintain availability?

Level 3: Single pod in production (canary)
         Risk: slight degradation for some users
         Learning: does production match staging behavior?

Level 4: Percentage of production pods
         Risk: measurable user impact possible
         Learning: load balancing, retry behavior, circuit breaking

Level 5: Entire availability zone
         Risk: significant impact if HA is broken
         Learning: multi-AZ resilience, failover behavior

Level 6: Full region failure
         Risk: major outage potential
         Learning: disaster recovery, cross-region failover

Never jump levels. A team that hasn't validated Level 1 has no business attempting Level 4.

Interview tip: When discussing chaos engineering in interviews, emphasize the hypothesis-driven approach. The question "What happens if we kill a pod?" is chaos engineering. "Let's kill pods and see what breaks" is not. Interviewers want to hear about steady-state definition, blast radius control, and documented learnings.

9. Integration with Observability

Chaos experiments are only valuable if you can see the impact:

┌──────────────────────────────────────────┐
│  Chaos Experiment                        │
│  (Litmus/Chaos Mesh/tc)                  │
└─────────────┬────────────────────────────┘
              │ triggers
              v
┌──────────────────────────────────────────┐
│  Metrics Pipeline                        │
│  ├── Prometheus (request rate, errors,   │
│  │   latency, saturation)               │
│  ├── Grafana (real-time dashboards)      │
│  └── Alertmanager (did alerts fire?)     │
├──────────────────────────────────────────┤
│  Logging Pipeline                        │
│  ├── Loki/ELK (error logs, stack traces) │
│  └── Application logs (retry attempts,   │
│      circuit breaker state)              │
├──────────────────────────────────────────┤
│  Tracing Pipeline                        │
│  ├── Jaeger/Tempo (distributed traces)   │
│  └── Trace-to-log correlation            │
└──────────────────────────────────────────┘

Key questions to answer during each experiment: - Did the alerts fire within the expected time? - Were the dashboards useful for diagnosis? - Could you see the failure in traces? - Were the logs clear enough to identify root cause?

10. Network Partition Simulation

The most dangerous distributed system failure — split brain:

# Simulate a network partition between two services using iptables
# Block traffic from api-server pods to database pods
iptables -A OUTPUT -d 10.0.1.50 -j DROP      # Block outbound
iptables -A INPUT -s 10.0.1.50 -j DROP        # Block inbound

# Restore
iptables -D OUTPUT -d 10.0.1.50 -j DROP
iptables -D INPUT -s 10.0.1.50 -j DROP
# Chaos Mesh: network partition
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: partition-api-from-db
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  direction: both
  target:
    selector:
      namespaces:
        - production
      labelSelectors:
        app: database
  duration: "60s"

Common Pitfalls

  • Running chaos experiments without a hypothesis. "Let's see what happens" is not chaos engineering — it's just breaking things. Define what you expect to happen first.
  • Starting in production on day one. Start in staging. Validate your experiment tooling works correctly. Build confidence with low-risk experiments before touching production.
  • No abort mechanism. Every experiment needs a kill switch. If things go worse than expected, you need to stop immediately. Automate the abort, don't rely on manual intervention.
  • Chaos experiments without observability. If you can't see the impact, the experiment is worthless. Set up dashboards, alerts, and logging before injecting failures.
  • Treating chaos engineering as a one-time event. Systems change. New code deploys, new services are added, infrastructure evolves. Experiments that passed six months ago might fail today. Run them continuously.

    Remember: The chaos engineering loop: Hypothesize, Inject, Observe, Learn (HIOL). If you skip any step, you are not doing chaos engineering.

  • Confusing fault tolerance with tested fault tolerance. Having three replicas doesn't mean your service survives a pod kill. Having multi-AZ doesn't mean your service survives an AZ failure. The only way to know is to test it.


Wiki Navigation

Prerequisites