How does Redis handle high availability in production and what are the failure modes to understand?

Answer

In production, Redis high availability is implemented through Sentinel or Cluster, each with distinct failure modes. With Sentinel: the primary fails → Sentinels detect it via PING failure and quorum voting → the majority-elect Sentinel promotes the replica with the least lag → updates clients via Sentinel's service-discovery API. Key failure mode: split-brain — if the primary is network-partitioned (reachable by clients but not Sentinels), it continues accepting writes that will be lost when Sentinels promote a replica and the old primary rejoins as a replica and discards its diverged data. Mitigation: min-replicas-to-write 1 makes the primary reject writes if no replicas are connected. With Cluster: a master node fails → its replicas detect it via gossip → an automatic failover promotes the best replica (most up-to-date) → the cluster reconfigures slot ownership. Failure mode: if a shard's master and all its replicas fail simultaneously, that portion of the keyspace becomes unavailable. cluster-require-full-coverage no allows the rest of the cluster to continue serving. Always monitor replication lag (INFO replication) and test failover procedures regularly in staging.

Answer

More Redis Questions