Here is how Redis Cluster handles node failures:
Automatic Failover
When a Redis Cluster master node fails, the cluster will automatically promote one of the replica nodes for that master to become the new master. This failover process happens automatically without any manual intervention.
The key steps in the failover process are:
1. The other nodes in the cluster detect that the master node has failed, either through the cluster gossip protocol or by attempting to connect to the node.
2. The cluster then holds an election among the replica nodes for that master. The replica with the most up-to-date data is elected as the new master.
3. The elected replica node is promoted to become the new master. The other replicas are then reconfigured to replicate from the new master.
4. Clients are redirected to the new master node using the `MOVED` error response, which contains the new slot-to-node mapping.
Slot Coverage
To remain available during node failures, Redis Cluster requires that the majority of master nodes are available and able to communicate. This is known as maintaining "slot coverage".
Each master node in the cluster is responsible for a subset of the 16,384 hash slots. As long as the majority of masters are available and can serve their assigned slots, the cluster can continue to operate.
If too many master nodes fail such that the remaining masters cannot cover the full hash slot range, the cluster will become unavailable until the failed nodes recover or are replaced.
Replica Promotion
Redis Cluster relies on replica nodes to provide high availability. When a master fails, one of its replica nodes is automatically promoted to become the new master.
The replica with the most up-to-date data is elected as the new master. This ensures data consistency is maintained during the failover process.
Handling Split-Brain
Redis Cluster uses a quorum-based approach to handle network partitions and avoid "split-brain" scenarios where the cluster gets divided into multiple independent clusters.
If a majority of masters become unavailable, the remaining nodes will refuse client requests until the partition is resolved. This ensures data consistency is preserved.
By leveraging these automatic failover and quorum-based mechanisms, Redis Cluster is able to provide high availability and resilience in the face of node failures and network partitions.
Citations:[1] https://yifan-online.com/en/km/article/detail/16750
[2] https://stackoverflow.com/questions/63878562/two-failure-support-in-3-node-redis-cluster
[3] https://redis.io/docs/latest/operate/oss_and_stack/management/scaling/
[4] https://redis.io/docs/latest/commands/cluster-nodes/
[5] https://github.com/lettuce-io/lettuce-core/issues/2318