Skip to content

Chaos Engineering

← Back to all decks

26 cards — 🟢 5 easy | 🟡 10 medium | 🔴 5 hard

🟢 Easy (5)

1. What is Chaos Engineering?

Show answer [Wikipedia](https://en.wikipedia.org/wiki/Chaos_engineering): "Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."

[TechTarget](https://www.techtarget.com/searchitoperations/definition/chaos-engineering): "Chaos engineering is the process of testing a distributed computing system to ensure that it can withstand unexpected disruptions."

Remember: chaos engineering = 'break things on purpose to find weaknesses before they find you.' Proactive, not reactive.

Name origin: Netflix coined the term with Chaos Monkey (2011) — randomly kills EC2 instances in production to test resilience.

2. What is a steady-state hypothesis in chaos engineering?

Show answer A steady-state hypothesis defines what "normal" looks like for your system before an experiment begins. It is a measurable assertion (e.g., p99 latency < 200ms, error rate < 0.1%) that you expect to remain true during the experiment. If the steady state breaks, the experiment has found a weakness.

Remember: chaos experiments follow the scientific method: hypothesis -> inject fault -> observe -> conclude. 'We believe killing one node will not increase error rate above 0.1%.'

3. Name the main categories of failure injection in chaos engineering.

Show answer Infrastructure: instance/node termination, disk fill, clock skew.
Network: latency injection, packet loss, DNS failure, partition.
Application: process kill, memory pressure, CPU stress, exception injection.
Dependency: external service unavailability, slow responses, certificate expiry.
State: data corruption, cache invalidation, queue backup.

Remember: the goal of chaos engineering is confidence in your system's resilience, not finding bugs (though that's a welcome side effect). Run experiments regularly, not just once.

4. List the core principles of chaos engineering.

Show answer 1) Build a hypothesis around steady-state behavior; 2) Vary real-world events (simulate realistic failures); 3) Run experiments in production when possible; 4) Automate experiments to run continuously; 5) Minimize blast radius. These principles come from the Chaos Engineering manifesto (principlesofchaos.org).

Remember: the goal of chaos engineering is confidence in your system's resilience, not finding bugs (though that's a welcome side effect). Run experiments regularly, not just once.

5. What is the difference between resilience testing and chaos engineering?

Show answer Resilience testing verifies known failure modes with expected outcomes (like a test suite). Chaos engineering explores unknown failure modes by injecting real-world turbulence and observing what happens. Resilience testing asks "does our retry logic work?" while chaos engineering asks "what breaks when the database is slow for 30 seconds?"

Remember: the goal of chaos engineering is confidence in your system's resilience, not finding bugs (though that's a welcome side effect). Run experiments regularly, not just once.

🟡 Medium (10)

1. Cite a few tools used to operate Chaos exercises

Show answer - AWS Fault Injection Simulator: inject failures in AWS resources
- Azure Chaos Studio: inject failures in Azure resources
- Chaos Monkey: one of the most famous tools to orchestrate Chaos on diverse Cloud providers
- Litmus - A Framework for Kubernetes
- Chaos Mesh: for Cloud Kubernetes platforms


See an extensive list [here](https://github.com/dastergon/awesome-chaos-engineering)

Remember: the goal of chaos engineering is confidence in your system's resilience, not finding bugs (though that's a welcome side effect). Run experiments regularly, not just once.

2. What's a typical Chaos Engineering workflow?

Show answer According to [Gremlin](https://gremlin.com) there are three steps:

1. Planning an experiment where you design and choose a scenario in which your system should fail to operate properly
2. You execute the smallest possible experiment to test your theory
3. If nothing goes wrong, you scale your experiment and make the blast radius bigger. If your system breaks, you better understand why and start dealing with it

The process then repeats itself either with same scenario or a new one.

Remember: the goal of chaos engineering is confidence in your system's resilience, not finding bugs (though that's a welcome side effect). Run experiments regularly, not just once.

3. What safety controls should chaos scripts have?

Show answer 1) Namespace scoping (only approved namespaces); 2) Dry-run mode by default; 3) Explicit --yes flag for destructive operations; 4) Built-in restore/cleanup; 5) Never operate on kube-system without explicit override; 6) Reversible changes only.

Remember: the goal of chaos engineering is confidence in your system's resilience, not finding bugs (though that's a welcome side effect). Run experiments regularly, not just once.

4. Name 5 safe chaos experiments for a Kubernetes namespace.

Show answer 1) Kill pods (deployment controller recreates them); 2) Break readiness probe (pod drops from service); 3) CPU/memory stress pods (test resource limits); 4) Apply deny-all NetworkPolicy (test network resilience); 5) Scale deployment to zero then restore (test recovery time).

Remember: the goal of chaos engineering is confidence in your system's resilience, not finding bugs (though that's a welcome side effect). Run experiments regularly, not just once.

5. Compare Litmus, Gremlin, and Chaos Monkey as chaos engineering tools.

Show answer Chaos Monkey (Netflix): random instance termination in AWS, pioneered the discipline.
Gremlin: commercial SaaS platform with a UI, supports CPU/memory/network/state attacks, rollback controls, and team collaboration.
Litmus: CNCF open-source project, Kubernetes-native, uses ChaosEngine CRDs, supports litmus experiments from a hub.
Key difference: Chaos Monkey is narrow (kill instances), Gremlin is broad and commercial, Litmus is K8s-native and open-source.

See also: Chaos Monkey (Netflix), LitmusChaos (CNCF), Gremlin (SaaS), chaos-mesh (K8s-native), Pumba (container chaos).

6. What is blast radius in chaos engineering and how do you control it?

Show answer Blast radius is the scope of impact an experiment can have. Control it by: 1) Start in non-production environments; 2) Target a single service or pod first; 3) Use namespace scoping; 4) Set duration limits and auto-abort conditions; 5) Run during low-traffic windows; 6) Have a kill switch to stop the experiment instantly; 7) Gradually expand scope only after confidence grows.

Remember: start with the smallest blast radius (one pod, one instance) and expand. Never go straight to 'kill an AZ' without validating smaller failures first.

7. What is a GameDay and how do you plan one?

Show answer A GameDay is a scheduled chaos exercise where teams inject failures into a system while observing the impact. Planning steps: 1) Define scope and objectives; 2) Select experiments (start simple); 3) Brief all participants on the plan and rollback procedures; 4) Establish communication channels; 5) Run experiments while monitoring dashboards; 6) Document observations in real time; 7) Hold a blameless retrospective; 8) Create follow-up action items.

Remember: the goal of chaos engineering is confidence in your system's resilience, not finding bugs (though that's a welcome side effect). Run experiments regularly, not just once.

8. What rollback triggers should chaos experiments have?

Show answer Auto-abort when: 1) Error rate exceeds a threshold (e.g., 5x baseline); 2) Latency spikes beyond SLO; 3) A dependent service becomes unreachable; 4) Customer-facing impact is detected; 5) The experiment duration exceeds the planned window; 6) An operator manually hits the kill switch. Always define these triggers before the experiment starts.

Remember: the goal of chaos engineering is confidence in your system's resilience, not finding bugs (though that's a welcome side effect). Run experiments regularly, not just once.

9. Why is observability critical during chaos experiments?

Show answer Without observability you cannot measure impact or validate your steady-state hypothesis. You need: 1) Real-time dashboards showing key metrics (latency, error rate, throughput); 2) Distributed tracing to see how failures propagate; 3) Log aggregation to capture error details; 4) Alerting that fires during the experiment to validate alert coverage. Chaos experiments often reveal observability gaps themselves.

Remember: the goal of chaos engineering is confidence in your system's resilience, not finding bugs (though that's a welcome side effect). Run experiments regularly, not just once.

10. Why must you validate observability before running chaos experiments?

Show answer If dashboards and alerts cannot detect the failure you are about to inject, the experiment is wasted — you learn nothing about detection capability. Run a pre-check: verify the relevant SLI dashboard shows the baseline, alerts are active, and on-call knows the experiment is happening.

Remember: the goal of chaos engineering is confidence in your system's resilience, not finding bugs (though that's a welcome side effect). Run experiments regularly, not just once.

🔴 Hard (5)

1. How do you introduce chaos safely in a production-like environment?

Show answer Start small: 1) Namespace-scoped only; 2) One failure at a time; 3) Have runbooks ready; 4) Monitor during experiment; 5) Automated rollback on unexpected impact; 6) GameDay format with the team present; 7) Document findings and improve resilience. Never surprise people.

Remember: the goal of chaos engineering is confidence in your system's resilience, not finding bugs (though that's a welcome side effect). Run experiments regularly, not just once.

2. How can chaos experiments be integrated into a CI/CD pipeline?

Show answer Run lightweight chaos tests as a pipeline stage after deployment to a staging environment. Approach: 1) Deploy to staging; 2) Verify steady state via health checks; 3) Inject a controlled failure (e.g., kill a pod, add latency); 4) Assert steady-state hypothesis holds; 5) Fail the pipeline if assertions break. Tools like Litmus, Gremlin, or Chaos Mesh provide CLI/API interfaces suitable for pipeline integration. Keep experiments short and auto-revert.

Remember: the goal of chaos engineering is confidence in your system's resilience, not finding bugs (though that's a welcome side effect). Run experiments regularly, not just once.

3. How does AWS Fault Injection Simulator (FIS) work?

Show answer FIS is a managed service for running chaos experiments on AWS resources. You define an experiment template specifying: 1) Targets (EC2 instances, ECS tasks, RDS clusters); 2) Actions (stop instances, inject API errors, throttle network); 3) Stop conditions (CloudWatch alarm thresholds). FIS integrates with IAM for permissions and CloudWatch for monitoring. It supports gradual rollout and automatic stop on alarm.

Remember: the goal of chaos engineering is confidence in your system's resilience, not finding bugs (though that's a welcome side effect). Run experiments regularly, not just once.

4. What is Chaos Mesh and how does it inject faults in Kubernetes?

Show answer Chaos Mesh is a CNCF sandbox project for cloud-native chaos engineering on Kubernetes. It uses CRDs to define experiments (PodChaos, NetworkChaos, StressChaos, IOChaos). It injects faults by manipulating pod containers via sidecar injection or privileged DaemonSets. Features include a web dashboard, RBAC integration, scheduled experiments, and workflow orchestration for multi-step chaos scenarios.

Remember: chaos engineering = 'break things on purpose to find weaknesses before they find you.' Proactive, not reactive.

Name origin: Netflix coined the term with Chaos Monkey (2011) — randomly kills EC2 instances in production to test resilience.

5. What does a chaos engineering maturity progression look like?

Show answer Level 1: Kill single processes, observe recovery. Level 2: Network faults (latency, partition) between services. Level 3: Dependency failures (database, cache, DNS). Level 4: Multi-AZ/region failures. Level 5: Automated chaos in CI/CD with auto-rollback. Each level requires stronger observability and blast-radius controls before advancing.

Remember: the goal of chaos engineering is confidence in your system's resilience, not finding bugs (though that's a welcome side effect). Run experiments regularly, not just once.