What is chaos engineering?
Answer
Chaos engineering is the practice of intentionally introducing failures into a system in production (or production-like environments) to discover weaknesses before they cause unplanned outages. Originated at Netflix (Chaos Monkey, the Simian Army). The discipline: formulate a hypothesis about steady-state behavior, introduce controlled failures, observe deviations, learn and improve. Common experiments: terminate random instances, inject network latency/packet loss, simulate datacenter failure (Chaos Kong), exhaust CPU/memory, introduce database slowdowns. Tools: Chaos Monkey (Netflix), Gremlin, LitmusChaos (Kubernetes), AWS Fault Injection Service, Istio (service mesh fault injection). Principles: (1) Start with a well-defined hypothesis. (2) Run experiments in production (or a faithful replica). (3) Have automated rollback. (4) Limit blast radius initially. (5) Automate experiments to run continuously. Chaos engineering reveals: unhandled failure modes, missing circuit breakers, cascading failures, single points of failure that testing environments miss.
Previous
What is the testing scope for microservices?
Next
What is code review in the context of quality assurance?