What is high availability (HA)?

Why Interviewers Ask This

This is a classic screening question for System Design roles. Hiring managers ask it early in interviews to gauge your baseline understanding and determine if you can communicate technical concepts clearly.

Answer

High Availability (HA) is the ability of a system to remain operational and accessible for a very high percentage of time, minimizing downtime. Measured as a percentage of uptime: 99.9% = "three nines" (~8.7h downtime/year); 99.99% = "four nines" (~52min/year); 99.999% = "five nines" (~5min/year). HA principles: (1) Eliminate SPOFs: redundant components at every layer (web, app, database, network); (2) Automated failover: system detects failures and reroutes automatically without human intervention — health checks, heartbeats, Kubernetes pod rescheduling; (3) Graceful degradation: when a component fails, serve a reduced-functionality response rather than a total failure — "circuit breaker" pattern; (4) Geographic distribution: multi-AZ (availability zone) deployments within a region; multi-region for disaster recovery; (5) Zero-downtime deployments: rolling updates, blue-green deployments, canary releases — deploy new code without taking the system down; (6) Health checks and self-healing: Kubernetes restarts crashed pods; load balancers remove unhealthy servers. Active-Passive HA: one active server + one standby — simple, wastes standby capacity. Active-Active HA: all servers active, load balanced — no waste, full capacity. HA vs DR (Disaster Recovery): HA prevents downtime from component failures; DR recovers from catastrophic failures (datacenter loss) with acceptable RTO (Recovery Time Objective) and RPO (Recovery Point Objective).

Pro Tip

Demonstrate both theoretical understanding and practical experience. Say what it is, then give an example of how you actually used it in a System Design codebase.