Distributed Services: Raft Consensus Algorithm Illustrated

What is the Raft Algorithm

Raft is a distributed consensus algorithm specifically designed for managing replicated log systems. It was proposed by Diego Ongaro and John Ousterhout in 2014 to provide an alternative that is easier to understand and implement than traditional Paxos algorithms.

Comparison with Paxos

Raft provides the same functionality and performance guarantees as Paxos, but has significant differences in algorithm structure and implementation:

More modular: Decomposes complex problems into multiple independent sub-problems
More intuitive: Simplifies understanding through explicit role division and state transitions
Better suited for engineering implementation: Provides complete algorithm details rather than theoretical frameworks

Core Modules of Raft

Raft decomposes the consensus algorithm into three key modules:

1. Leader Election

At any time, there can be at most one effective leader in the cluster
Nodes are divided into three roles: Leader, Follower, and Candidate
The election process uses a random timeout mechanism to avoid vote splitting

2. Log Replication

The leader receives client requests and writes them to logs
The leader replicates log entries to all follower nodes
When a majority of nodes confirm receipt of the log, it is considered committed
Committed logs are eventually applied by all nodes

3. Safety

Election restriction: Only nodes containing all committed logs can become leaders
Log matching property: If two logs have the same term at the same index, they are identical
State machine safety property: Once a log entry is applied on one server, other servers cannot apply a different command at the same index

Two-Phase Operation of Raft Algorithm

Phase One: Election Process

Initially, all nodes are in follower state
If a follower does not receive a leader heartbeat within the election timeout (usually 150-300ms), it transitions to candidate
The candidate initiates an election and requests votes from other nodes
The candidate receiving majority votes becomes the new leader
The new leader starts sending heartbeats to followers to maintain authority

Phase Two: Normal Operation

The leader processes all client requests
Each client request is first recorded as a log entry
The leader replicates the log to follower nodes in parallel
After the log is replicated to a majority of nodes, the leader applies the log and notifies followers to apply
The leader periodically sends heartbeats (usually every 50ms) to maintain leadership

Practical Application Scenarios

The Raft algorithm is widely applied in various distributed systems:

Distributed key-value storage (such as etcd, Consul)
Distributed databases (such as CockroachDB, TiDB)
Container orchestration systems (such as Kubernetes)
Blockchain consensus mechanisms

Leader Election

Raft implements consensus by electing a leader and giving them full responsibility for managing the replicated log.

In Raft, a server can play one of the following roles at any time:

Leader: Handles client interactions, log replication, etc. Generally, there is only one leader at a time
Candidate: An entity that nominates itself during the election process. Once the election succeeds, it becomes the leader
Follower: Similar to a voter, a completely passive role. Such servers wait to be notified of votes

Elections affect their identity changes.

Raft uses the heartbeat mechanism to trigger elections. When a Server starts, the initial state is Follower. Each Server has a timer with a timeout period of Election timeout (generally 150-300ms). If a Server receives any message from the leader or candidate before timing out, the timer restarts. If it times out, it starts an election.

Node Exception Types and Handling Mechanisms

Leader Unavailable

When the Leader node in the cluster fails, it causes the entire cluster to temporarily be unable to process write requests. This situation may be caused by:

Server hardware failure (such as CPU overload, memory exhaustion)
Network partitioning causing the leader to lose connection with other nodes
Leader process crash or forced termination

Typical Handling Process:

Follower nodes detect heartbeat timeout with the leader (usually set with 150-300ms timeout threshold)
Follower transitions to Candidate state and starts a new round of elections
After the new leader is successfully elected, it takes over cluster management

After a period of time, if the previous leader rejoins the cluster, the two leaders compare their step numbers. The leader with the lower step number switches its state to Follower. The inconsistent logs in the earlier leader are cleared and kept consistent with the current leader’s logs.

Follower Unavailable

When some Follower nodes fail, the cluster can still maintain basic operation, but it affects the consistency of data replication. Common scenarios include:

Short-term network jitter causing temporary disconnection of followers
Follower node disk space is insufficient to write logs
Configuration errors causing followers to fail to join the cluster

Recovery Strategies:

Leader continuously attempts to establish connections with followers
After the follower recovers, it catches up with the latest state through log replication
If recovery fails for a long time, it may trigger automatic node replacement mechanism

When a follower node is unavailable, it is relatively easy to solve because the log content in the cluster is always synchronized from the leader node. As long as this node rejoins the cluster and re-replicates logs from the leader node.

Election Conflicts (Multiple Candidates/Leaders)

Election abnormalities may occur under specific network conditions:

Network partitioning causes different partitions to each elect a leader
Unreasonable election timeout settings causing frequent elections
Node clock asynchronization affecting election logic

Solutions:

Adopt PreVote mechanism to prevent frequent elections
Set reasonable election timeout parameters (recommended follower timeout is 2-3 times that of the leader)
Implement Term mechanism to ensure only one legal leader

The appearance of multiple Candidates or multiple Leaders in the cluster is usually caused by poor data transmission. Having multiple leaders is relatively rare, but multiple Candidates are more likely to appear during the chaotic period when cluster nodes start and have not yet selected a leader.

The Candidate continues to ask other Followers. Since some Followers have already voted, they all return rejections. When step numbers are the same, the Candidate rejects another Candidate’s request. Since no leader was selected in the first round, the Candidate randomly selects a waiting interval (150ms-300ms) to initiate another vote.

If it receives acceptance from a majority of followers in the cluster, this Candidate becomes the Leader. After being rejected by the majority of nodes and knowing that a Leader exists in the cluster, this Candidate node stops vote requests, switches to Follower, and synchronizes logs from the Leader node.

New Node Joining Cluster

New nodes joining during cluster expansion require special handling:

1. Configuration Phase:

Prepare new node hardware resources
Install the same service software version
Configure initial cluster information

2. Join Process:

New node starts in Learner mode
Synchronize complete state data from leader
Transition to official Follower after reaching synchronization threshold

3. Data Synchronization Strategy:

Snapshot transmission (suitable for large states)
Incremental log replication
Consistency check mechanism