Quiz: Chaos Engineering¶

7 questions

L0 (1 questions)¶

1. What is the difference between chaos engineering and just breaking things?

Show answer

Chaos engineering starts with a hypothesis, defines steady state, controls blast radius, uses automated experiments, and documents results. Just breaking things has no hypothesis, no baseline, no controlled scope, uses manual destruction, and keeps no record. Chaos engineering builds confidence in system resilience; random destruction creates fear.

L1 (3 questions)¶

1. What is a steady-state hypothesis in chaos engineering, and why must you define it before injecting any failure?

Show answer

A steady-state hypothesis defines what 'normal' looks like using measurable indicators (e.g., success rate >99.5%, p99 latency <500ms, error rate <0.5%). You must define it before the experiment so you have an objective baseline to compare against. Without it, you cannot determine whether the system maintained acceptable behavior during the failure injection.

2. What are the most common chaos engineering tools and when would you use each?

Show answer

Chaos Monkey (Netflix): randomly terminates instances — tests instance-level resilience. Litmus (CNCF): Kubernetes-native chaos with ChaosEngine CRDs — pod kill, network loss, disk fill. Gremlin: SaaS platform with broad attack types and safety controls — good for enterprises needing audit trails. Toxiproxy (Shopify): simulates network conditions (latency, packet loss) between services — great for integration tests. AWS Fault Injection Simulator: native AWS chaos (instance stop, AZ failure, API throttling). Start with pod/instance kill, graduate to network chaos, then infrastructure-level.

3. What is the difference between chaos engineering and traditional failure testing (disaster recovery drills)?

Show answer

DR drills test known failure modes with known recovery procedures (failover to secondary DC, restore from backup). Chaos engineering explores unknown failure modes by injecting realistic faults and observing emergent behavior. DR asks 'can we recover from this known scenario?' Chaos asks 'what happens to our system under this condition — and do we even know?' DR is pass/fail. Chaos produces learning outcomes even when the system handles the fault perfectly. Both are needed: DR validates your runbooks, chaos discovers gaps your runbooks do not cover.

L2 (3 questions)¶

1. What is the blast radius progression in chaos engineering, and why should you never skip levels?

Show answer

The progression goes: (1) single pod in staging, (2) multiple pods in staging, (3) single pod in production, (4) percentage of production pods, (5) entire availability zone, (6) full region failure. You should never skip levels because each level validates assumptions needed for the next. A team that has not confirmed their service survives a single pod kill has no business testing AZ failure — they would risk a real outage with unknown failure modes.

2. How do you implement automated chaos experiments in a CI/CD pipeline without risking production stability?

Show answer

1. Run chaos tests in a dedicated staging environment that mirrors production topology.
2. Gate production deploys on chaos test pass (e.g., kill 1 pod during integration tests, verify SLO holds).
3. In production: use GameDay scheduling with explicit opt-in and runbook.
4. Implement automated abort conditions: if error rate exceeds 2x baseline, automatically halt the experiment and roll back.
5. Use canary analysis during chaos — compare control group vs experiment group metrics.
6. Start with low-blast-radius experiments in production (single pod, short duration) and expand only after confidence is established.

3. How do you measure the value of a chaos engineering program and justify it to leadership?

Show answer

Track:
1. Incidents prevented — chaos experiments that revealed weaknesses fixed before they caused outages (with estimated cost of avoided downtime).
2. Mean time to recovery (MTTR) improvement — teams with chaos practice recover faster because they have seen failures before.
3. Confidence level — percentage of services with validated resilience (passed chaos tests for their tier's failure modes).
4. Findings backlog — number of resilience gaps discovered and their severity.
5. Coverage — percentage of critical paths tested. Present in business terms: 'We found and fixed 12 resilience gaps that would have caused X hours of downtime at Y cost.'