Portal | Level: L2: Operations | Topics: Chaos Engineering, Kubernetes Core | Domain: DevOps & Tooling
Chaos Engineering & Fault Injection - Primer¶
Why This Matters¶
Every distributed system fails. The question is whether you discover the failure modes in a controlled experiment at 2pm on a Tuesday, or at 3am on a Saturday when your biggest customer's traffic is peaking. Chaos engineering is the discipline of proactively injecting failures to find weaknesses before they become outages. It's not about breaking things for fun — it's about building confidence in your system's ability to withstand turbulent conditions.
Who made it: Netflix released Chaos Monkey as open source in 2012. It was part of the Simian Army (Chaos Gorilla for AZ failure, Chaos Kong for region failure, Latency Monkey for network delays). The term "chaos engineering" was formalized by Casey Rosenthal and Nora Jones in the 2017 O'Reilly book Chaos Engineering.
Netflix popularized this with Chaos Monkey, but you don't need Netflix-scale to benefit. If you run more than one service that talks to another service, you have a distributed system with failure modes you haven't discovered yet. Chaos engineering helps you find them.
Core Concepts¶
1. The Scientific Method Applied to Infrastructure¶
Chaos engineering follows a structured process, not random destruction:
┌───────────────────────────────────────────────────────┐
│ 1. Define steady state │
│ What does "healthy" look like? (metrics, SLIs) │
│ │
│ 2. Hypothesize │
│ "If we kill 1 of 3 API pods, latency stays <200ms"│
│ │
│ 3. Introduce variables │
│ Inject the failure (pod kill, network delay, etc.) │
│ │
│ 4. Observe │
│ Did steady state hold? What broke? What recovered? │
│ │
│ 5. Learn │
│ Document findings. Fix weaknesses. Repeat. │
└───────────────────────────────────────────────────────┘
The difference between chaos engineering and just breaking things:
| Chaos Engineering | Just Breaking Things |
|---|---|
| Hypothesis first | "Let's see what happens" |
| Steady-state defined | No baseline |
| Blast radius controlled | Anything goes |
| Automated experiments | Manual destruction |
| Results documented | No record |
| Builds confidence | Creates fear |
2. Steady-State Hypothesis¶
Before injecting any failure, you need to define what "normal" looks like. This is your steady-state hypothesis.
Steady-State Indicators (pick the relevant ones):
├── Request success rate > 99.5%
├── P99 latency < 500ms
├── Error rate < 0.5%
├── Queue depth < 1000
├── Pod restart count = 0
├── Database connection pool < 80% utilized
└── CPU utilization < 70% across nodes
Your hypothesis takes the form: "When we inject [failure X], these steady-state indicators remain within acceptable bounds."
If the hypothesis holds, you've gained confidence. If it fails, you've found a weakness before your customers did.
3. Kubernetes Pod Experiments¶
The most common starting point — kill pods and see what happens:
# Litmus Chaos: pod-delete experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: api-pod-delete
namespace: production
spec:
appinfo:
appns: production
applabel: "app=api-server"
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60" # Run for 60 seconds
- name: CHAOS_INTERVAL
value: "10" # Kill a pod every 10 seconds
- name: FORCE
value: "false" # Graceful termination first
- name: PODS_AFFECTED_PERC
value: "50" # Kill 50% of matching pods
# Chaos Mesh: pod-kill experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: api-pod-kill
namespace: chaos-testing
spec:
action: pod-kill
mode: one # Kill one pod at a time
selector:
namespaces:
- production
labelSelectors:
app: api-server
duration: "60s"
scheduler:
cron: "@every 2m" # Repeat every 2 minutes
4. Network Fault Injection¶
Network failures are the most common and least tested failure mode:
Under the hood:
tc(traffic control) uses the Linuxnetem(network emulator) queueing discipline. It operates at the kernel level on egress — packets are delayed, dropped, or reordered after the application sends them but before they hit the wire. This is the same mechanism that tools like Chaos Mesh use internally.
# tc (traffic control) — built into Linux, no dependencies
# Add 200ms latency to all traffic on eth0
tc qdisc add dev eth0 root netem delay 200ms
# Add 200ms latency with 50ms jitter (more realistic)
tc qdisc add dev eth0 root netem delay 200ms 50ms
# 5% packet loss
tc qdisc add dev eth0 root netem loss 5%
# Combine: latency + packet loss + reordering
tc qdisc add dev eth0 root netem delay 100ms 25ms loss 2% reorder 10%
# Remove all tc rules
tc qdisc del dev eth0 root
# Target specific destination (more surgical)
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 500ms
tc filter add dev eth0 parent 1:0 protocol ip u32 \
match ip dst 10.0.1.50/32 flowid 1:3
# Chaos Mesh: network chaos
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: api-network-delay
namespace: chaos-testing
spec:
action: delay
mode: all
selector:
namespaces:
- production
labelSelectors:
app: api-server
delay:
latency: "300ms"
jitter: "50ms"
correlation: "75"
direction: to
target:
selector:
namespaces:
- production
labelSelectors:
app: database
duration: "120s"
5. CPU and Memory Stress¶
Test how your services behave under resource pressure:
# stress-ng — comprehensive stress testing tool
# CPU stress: 4 workers, 60 second duration
stress-ng --cpu 4 --timeout 60s
# Memory stress: allocate 2GB
stress-ng --vm 2 --vm-bytes 2G --timeout 60s
# Disk I/O stress
stress-ng --hdd 2 --timeout 60s
# Combined (realistic: everything at once)
stress-ng --cpu 2 --vm 1 --vm-bytes 1G --hdd 1 --timeout 120s
# Chaos Mesh: stress chaos
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: api-cpu-stress
namespace: chaos-testing
spec:
mode: one
selector:
namespaces:
- production
labelSelectors:
app: api-server
stressors:
cpu:
workers: 2
load: 80 # 80% CPU pressure
memory:
workers: 1
size: "512MB"
duration: "120s"
6. Disk Pressure Testing¶
# Fill a filesystem to trigger pressure responses
# WARNING: do this on test volumes, never on /
dd if=/dev/zero of=/mnt/data/fill_file bs=1M count=10000
# Watch how your application responds:
# - Does it log errors clearly?
# - Does the health check fail?
# - Does Kubernetes evict the pod?
# - Does the PVC resize trigger work?
# Clean up
rm /mnt/data/fill_file
# Litmus Chaos: disk-fill experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: disk-fill-test
spec:
appinfo:
appns: production
applabel: "app=database"
appkind: statefulset
experiments:
- name: disk-fill
spec:
components:
env:
- name: FILL_PERCENTAGE
value: "90" # Fill to 90%
- name: TOTAL_CHAOS_DURATION
value: "120"
- name: EPHEMERAL_STORAGE_MEBIBYTES
value: "" # Target PVC, not ephemeral
7. Game Day Planning¶
A game day is a scheduled chaos experiment with the whole team watching:
Game Day Runbook:
Before:
├── 1. Define scope and blast radius (which services, which env)
├── 2. Write the steady-state hypothesis
├── 3. Prepare rollback plan (how to stop the experiment immediately)
├── 4. Notify stakeholders (other teams, on-call, management)
├── 5. Set up dashboards with steady-state metrics visible
├── 6. Confirm the experiment automation is tested in staging
└── 7. Schedule during business hours (people available to respond)
During:
├── 1. Announce start on incident channel
├── 2. Run experiment
├── 3. Observe dashboards — everyone watches the same screen
├── 4. Note anomalies, unexpected behaviors, and recovery times
├── 5. Stop immediately if blast radius exceeds plan
└── 6. Announce completion
After:
├── 1. Document results vs. hypothesis
├── 2. Create tickets for discovered weaknesses
├── 3. Prioritize fixes by blast radius and likelihood
├── 4. Schedule follow-up experiment to verify fixes
└── 5. Share findings with the broader engineering org
8. Blast Radius Control¶
The most important concept in chaos engineering — limit the scope of damage:
Blast Radius Progression (start small, grow with confidence):
Level 1: Single pod in staging
Risk: minimal
Learning: does the service handle pod restarts?
Level 2: Multiple pods in staging
Risk: staging instability
Learning: does the deployment maintain availability?
Level 3: Single pod in production (canary)
Risk: slight degradation for some users
Learning: does production match staging behavior?
Level 4: Percentage of production pods
Risk: measurable user impact possible
Learning: load balancing, retry behavior, circuit breaking
Level 5: Entire availability zone
Risk: significant impact if HA is broken
Learning: multi-AZ resilience, failover behavior
Level 6: Full region failure
Risk: major outage potential
Learning: disaster recovery, cross-region failover
Never jump levels. A team that hasn't validated Level 1 has no business attempting Level 4.
Interview tip: When discussing chaos engineering in interviews, emphasize the hypothesis-driven approach. The question "What happens if we kill a pod?" is chaos engineering. "Let's kill pods and see what breaks" is not. Interviewers want to hear about steady-state definition, blast radius control, and documented learnings.
9. Integration with Observability¶
Chaos experiments are only valuable if you can see the impact:
┌──────────────────────────────────────────┐
│ Chaos Experiment │
│ (Litmus/Chaos Mesh/tc) │
└─────────────┬────────────────────────────┘
│ triggers
v
┌──────────────────────────────────────────┐
│ Metrics Pipeline │
│ ├── Prometheus (request rate, errors, │
│ │ latency, saturation) │
│ ├── Grafana (real-time dashboards) │
│ └── Alertmanager (did alerts fire?) │
├──────────────────────────────────────────┤
│ Logging Pipeline │
│ ├── Loki/ELK (error logs, stack traces) │
│ └── Application logs (retry attempts, │
│ circuit breaker state) │
├──────────────────────────────────────────┤
│ Tracing Pipeline │
│ ├── Jaeger/Tempo (distributed traces) │
│ └── Trace-to-log correlation │
└──────────────────────────────────────────┘
Key questions to answer during each experiment: - Did the alerts fire within the expected time? - Were the dashboards useful for diagnosis? - Could you see the failure in traces? - Were the logs clear enough to identify root cause?
10. Network Partition Simulation¶
The most dangerous distributed system failure — split brain:
# Simulate a network partition between two services using iptables
# Block traffic from api-server pods to database pods
iptables -A OUTPUT -d 10.0.1.50 -j DROP # Block outbound
iptables -A INPUT -s 10.0.1.50 -j DROP # Block inbound
# Restore
iptables -D OUTPUT -d 10.0.1.50 -j DROP
iptables -D INPUT -s 10.0.1.50 -j DROP
# Chaos Mesh: network partition
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: partition-api-from-db
spec:
action: partition
mode: all
selector:
namespaces:
- production
labelSelectors:
app: api-server
direction: both
target:
selector:
namespaces:
- production
labelSelectors:
app: database
duration: "60s"
Common Pitfalls¶
- Running chaos experiments without a hypothesis. "Let's see what happens" is not chaos engineering — it's just breaking things. Define what you expect to happen first.
- Starting in production on day one. Start in staging. Validate your experiment tooling works correctly. Build confidence with low-risk experiments before touching production.
- No abort mechanism. Every experiment needs a kill switch. If things go worse than expected, you need to stop immediately. Automate the abort, don't rely on manual intervention.
- Chaos experiments without observability. If you can't see the impact, the experiment is worthless. Set up dashboards, alerts, and logging before injecting failures.
-
Treating chaos engineering as a one-time event. Systems change. New code deploys, new services are added, infrastructure evolves. Experiments that passed six months ago might fail today. Run them continuously.
Remember: The chaos engineering loop: Hypothesize, Inject, Observe, Learn (HIOL). If you skip any step, you are not doing chaos engineering.
-
Confusing fault tolerance with tested fault tolerance. Having three replicas doesn't mean your service survives a pod kill. Having multi-AZ doesn't mean your service survives an AZ failure. The only way to know is to test it.
Wiki Navigation¶
Prerequisites¶
- Kubernetes Ops (Production) (Topic Pack, L2)
- Observability Deep Dive (Topic Pack, L2)
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
- Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
- Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
- Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
- Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core
- Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Kubernetes Core
- Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Kubernetes Core