How does Redis handle high availability in production and what are the failure modes to understand?
Answer
In production, Redis high availability is implemented through Sentinel or Cluster, each with distinct failure modes. With Sentinel: the primary fails → Sentinels detect it via PING failure and quorum voting → the majority-elect Sentinel promotes the replica with the least lag → updates clients via Sentinel's service-discovery API. Key failure mode: split-brain — if the primary is network-partitioned (reachable by clients but not Sentinels), it continues accepting writes that will be lost when Sentinels promote a replica and the old primary rejoins as a replica and discards its diverged data. Mitigation: min-replicas-to-write 1 makes the primary reject writes if no replicas are connected. With Cluster: a master node fails → its replicas detect it via gossip → an automatic failover promotes the best replica (most up-to-date) → the cluster reconfigures slot ownership. Failure mode: if a shard's master and all its replicas fail simultaneously, that portion of the keyspace becomes unavailable. cluster-require-full-coverage no allows the rest of the cluster to continue serving. Always monitor replication lag (INFO replication) and test failover procedures regularly in staging.
More Redis Questions
View all →- Advanced How does Redis Cluster sharding and the hash slot algorithm work in detail?
- Advanced What is consistent hashing and how does it compare to Redis Cluster's approach?
- Advanced What is memory optimization in Redis (ziplist, listpack, and encoding thresholds)?
- Advanced How does AOF rewrite work and how does it compare to RDB performance?
- Advanced What are Redis Functions (replacing Lua scripts in Redis 7+)?