Skip to content

Lab 16: Chaos Engineering

Field Value
Tier 4 — Advanced
Estimated Time 90 minutes
Prerequisites k3s cluster, Helm
Auto-Grade Yes

Scenario

Your team claims the application stack is "highly available," but nobody has tested what happens when things actually fail. The VP of Engineering wants proof — not promises — that the system can handle node failures, pod crashes, network partitions, and resource exhaustion. You have been tasked with designing and executing chaos experiments that stress-test the resilience of the stack.

The application is a three-tier architecture (frontend, API, database) deployed with multiple replicas and health checks. Your job is to inject five different failure modes, observe the system's behavior, verify it self-heals (or document that it does not), and write a resilience report with recommendations.

Objectives

  • Deploy a resilient 3-tier app stack with multiple replicas and probes
  • Experiment 1: Kill a random API pod and verify traffic shifts to surviving pods
  • Experiment 2: Inject CPU stress on a node and verify pods are evicted/rescheduled
  • Experiment 3: Add network latency to the database and measure API response degradation
  • Experiment 4: Fill a container's ephemeral storage and verify it is restarted
  • Experiment 5: Simulate a DNS failure and verify retries work
  • Write a resilience report to /tmp/lab-chaos/resilience-report.txt

Setup

./setup.sh

Deploys the target application stack in namespace lab-chaos.

Hints

Hint 1: Killing pods `kubectl delete pod -n lab-chaos` — the Deployment controller should recreate it. Monitor with `kubectl get pods -w -n lab-chaos`.
Hint 2: CPU stress Run a stress container: `kubectl run stress --image=alpine -n lab-chaos -- sh -c "apk add stress-ng && stress-ng --cpu 4 --timeout 60s"`. Watch node resource usage with `kubectl top nodes`.
Hint 3: Network latency injection Use `tc` inside a container: `kubectl exec -n lab-chaos -- tc qdisc add dev eth0 root netem delay 500ms`. Measure impact with `time kubectl exec -- wget -qO- http://api:8080/`.
Hint 4: Ephemeral storage exhaustion `kubectl exec -n lab-chaos -- dd if=/dev/zero of=/tmp/fill bs=1M count=500`. If ephemeral storage limits are set, the kubelet will evict the pod.
Hint 5: Resilience report structure For each experiment: hypothesis, method, observation, result (pass/fail), recommendation. Include a summary of overall system resilience.

Grading

./grade.sh

Solution

See the solution/ directory for experiment scripts and report template.