What is chaos engineering?

Answer

Chaos engineering is the practice of intentionally introducing failures into a system in production (or production-like environments) to discover weaknesses before they cause unplanned outages. Originated at Netflix (Chaos Monkey, the Simian Army). The discipline: formulate a hypothesis about steady-state behavior, introduce controlled failures, observe deviations, learn and improve. Common experiments: terminate random instances, inject network latency/packet loss, simulate datacenter failure (Chaos Kong), exhaust CPU/memory, introduce database slowdowns. Tools: Chaos Monkey (Netflix), Gremlin, LitmusChaos (Kubernetes), AWS Fault Injection Service, Istio (service mesh fault injection). Principles: (1) Start with a well-defined hypothesis. (2) Run experiments in production (or a faithful replica). (3) Have automated rollback. (4) Limit blast radius initially. (5) Automate experiments to run continuously. Chaos engineering reveals: unhandled failure modes, missing circuit breakers, cascading failures, single points of failure that testing environments miss.