What is fault tolerance?
Why Interviewers Ask This
This question tests conceptual clarity. Interviewers want to hear a precise, confident definition before moving to more complex System Design topics. It also reveals how well you can explain technical ideas to non-experts.
Answer
Fault tolerance is the ability of a system to continue operating correctly (possibly at a reduced level) even when some of its components fail. Fault tolerant systems detect, isolate, and recover from failures without manual intervention. Key concepts: (1) Redundancy: N+1 (one spare), N+2 (two spares), 2N (full duplication) redundancy strategies; (2) Replication: data replicated across multiple nodes — any node can serve the data if others fail; (3) Failover: automatic switch to backup component when primary fails — seconds-level RTO; (4) Circuit breaker: stops sending requests to a failing service, preventing cascade failures — returns to normal when service recovers; (5) Retry with backoff: automatically retry failed requests with exponential backoff + jitter; (6) Timeout: don't wait indefinitely for a response — fail fast; (7) Bulkhead: isolate failures so they don't spread — separate thread pools per downstream service (named after ship compartments that prevent entire sinking if one floods); (8) Graceful degradation: serve simplified response when a component is unavailable (show cached data, disable non-critical features). Difference from High Availability: HA focuses on uptime; fault tolerance focuses on correctness during failures. A fault-tolerant system may have brief unavailability but ensures data consistency. Byzantine fault tolerance: handles failures where nodes behave maliciously or send conflicting information — used in blockchain consensus algorithms.
Common Mistake
A common mistake is memorizing definitions without understanding implications. When asked this question, go one level deeper — explain what happens when this concept is misused or ignored.
Previous
What is the difference between latency and throughput?
Next
What is the difference between availability and reliability?