What is high availability (HA)?
Why Interviewers Ask This
This is a classic screening question for System Design roles. Hiring managers ask it early in interviews to gauge your baseline understanding and determine if you can communicate technical concepts clearly.
Answer
High Availability (HA) is the ability of a system to remain operational and accessible for a very high percentage of time, minimizing downtime. Measured as a percentage of uptime: 99.9% = "three nines" (~8.7h downtime/year); 99.99% = "four nines" (~52min/year); 99.999% = "five nines" (~5min/year). HA principles: (1) Eliminate SPOFs: redundant components at every layer (web, app, database, network); (2) Automated failover: system detects failures and reroutes automatically without human intervention — health checks, heartbeats, Kubernetes pod rescheduling; (3) Graceful degradation: when a component fails, serve a reduced-functionality response rather than a total failure — "circuit breaker" pattern; (4) Geographic distribution: multi-AZ (availability zone) deployments within a region; multi-region for disaster recovery; (5) Zero-downtime deployments: rolling updates, blue-green deployments, canary releases — deploy new code without taking the system down; (6) Health checks and self-healing: Kubernetes restarts crashed pods; load balancers remove unhealthy servers. Active-Passive HA: one active server + one standby — simple, wastes standby capacity. Active-Active HA: all servers active, load balanced — no waste, full capacity. HA vs DR (Disaster Recovery): HA prevents downtime from component failures; DR recovers from catastrophic failures (datacenter loss) with acceptable RTO (Recovery Time Objective) and RPO (Recovery Point Objective).
Pro Tip
Demonstrate both theoretical understanding and practical experience. Say what it is, then give an example of how you actually used it in a System Design codebase.
Previous
What is a single point of failure (SPOF)?
Next
What is the difference between latency and throughput?