What is a single point of failure (SPOF)?

Why Interviewers Ask This

Interviewers use this question to quickly assess whether a candidate has the foundational knowledge required for System Design development. It reveals whether you understand the building blocks that more complex concepts rely on.

Answer

A Single Point of Failure (SPOF) is any component in a system whose failure causes the entire system to fail. Eliminating SPOFs is fundamental to building highly available systems. Common SPOFs: single database server, single load balancer, single DNS server, single network switch, single power supply, single datacenter. Strategies to eliminate SPOFs: (1) Redundancy: add duplicate components — two load balancers (active-passive or active-active), database primary + replicas, multiple app server instances; (2) Failover: automatic switch to backup when primary fails — DNS failover, load balancer health checks remove failed servers; (3) Geographic redundancy: deploy in multiple datacenters/regions — regional failure doesn't bring down the whole system; (4) No shared state: stateless application servers — any server can handle any request (session data in Redis, not in-memory); (5) Chaos engineering: intentionally kill components to verify the system handles failures (Netflix Chaos Monkey). Availability calculation: system availability = product of all component availabilities for serial components. Two redundant components in parallel: 1 - (1 - A₁) × (1 - A₂). 99.9% + 99.9% in parallel = 99.9999%. Cost vs. availability: eliminating every SPOF is expensive — prioritize based on business impact. A system with 99.9% uptime has ~8.7 hours of downtime per year; 99.99% = 52 minutes; 99.999% = 5 minutes.

Common Mistake

Don't just define the term — demonstrate that you understand when to use it and when not to. Showing awareness of trade-offs is what separates average from strong System Design candidates.