Raft Algorithm

Raft is an algorithm for solving consensus problems in a network of unreliable computing. ^[1] Raft was developed taking into account the shortcomings of the older Paxos algorithm. When choosing key ideas, preference was given to simpler and more practical solutions. However, despite its relative simplicity, Raft provides a safe and efficient implementation of a state machine on top of a cluster computing system. There are many open source Raft implementations in various programming languages ^[2] .

Content

Features of the Raft Algorithm

Clear phase separation. Raft offers a decomposition of the cluster management task into several loosely coupled sub-tasks, the main of which are: leader selection (voting) and protocol replication. Each of these tasks allows for a more detailed separation. This simplifies the understanding of the algorithm and reduces the risk of errors in its implementation.
Clearly distinguished leader. Raft assumes that there is always a distinct leader on the cluster. Only this leader sends new records to other nodes in the cluster. Thus, the remaining nodes follow the leader and do not interact with each other (except for the voting phase). If an external client connects to the cluster through a regular node, then all its requests are redirected to the leader and only from there they come to the nodes.
Work protocols cannot contain omissions. That is, entries are added strictly sequentially. This imposes some limitations compared to Paxos, but it allows to greatly simplify the algorithm. In addition, the specifics of applied tasks, most often, do not allow working correctly with protocols containing omissions. The fact that Paxos allows such gaps to occur is often a disadvantage of Paxos, which is very difficult to deal with.
Resize a cluster. Raft makes it easy to change the cluster configuration without stopping work: add or remove nodes.

Raft Basics

Raft is built on top of a cluster, on each of the nodes of which a certain state machine works. Raft provides reliable signal delivery to all nodes in a predetermined order. This ensures the transition of all state machines along the same sequences of states. Thus, each node is guaranteed to come into agreement with other nodes.

An important circumstance is that Raft strictly numbers all entries in the protocol of work. Entries must be strictly sequential. These numbers play an important role in the operation of the algorithm. They determine the degree of relevance of the state of the node. When choosing a leader, the most relevant site always becomes a leader. The same numbers are used to number the voting sessions. A node can vote only once for each voting request.

Leader Choice

If a regular node does not receive messages from the leader for a long time, then it switches to the “candidate” state and sends a request for voting to other nodes. Other sites vote for the candidate from whom they received the first request. If the candidate receives a message from the leader, then he withdraws his candidacy and returns to normal. If a candidate receives the majority of votes, then he becomes a leader. If he did not receive a majority (this is the case when several candidates appeared on the cluster at once and the votes were divided), then the candidate waits for a random time and initiates a new voting procedure.

The voting procedure is repeated until a leader is selected.

Protocol Replication

The leader is fully responsible for the correct protocol replication. It sends a request to all cluster nodes to add a new record and considers the transaction successful only after most nodes confirm that the data has been applied and the result has been saved on a permanent medium (usually a hard disk).

Overhead

As can be seen from the description of the algorithm, voting relies on random expectations. In order for the algorithm to work efficiently, the following relationships must be satisfied:

the time for sending a request to all nodes of the cluster should be much less than the voting time. If this condition is not met, then it becomes difficult for the cluster to choose a leader due to the separation of votes between the candidates.
voting time should be much less than the normal cluster operation time. That is, the failure of nodes and voting should happen quite rarely.

In applied problems, these conditions are most often easily ensured. The interaction time of nodes is usually measured in milliseconds. Voting time - in seconds. Normal operation time between failures - months.

Interesting Facts

The algorithm, very similar to Raft, began to be used in Yandex key systems long before the appearance of the original article. ^[3]

Despite the common opposition between Raft and Paxos, in essence Raft is a protocol of the Paxos family, and has many common implementation details with MultiPaxos, ZAB (Zookeeper Atomic Broadcast) and others.

Notes

↑ Ongaro, Diego In Search of an Understandable Consensus Algorithm (unopened) (unavailable link) (2013). Date of treatment September 2, 2015. Archived on September 8, 2014.
↑ Raft Consensus Algorithm (Neopr.) (2014).
↑ YT - a new platform for distributed computing (neopr.) .

Links

lecture (Raft user study )

[paper-1] Ongaro, Diego In Search of an Understandable Consensus Algorithm (unopened) (unavailable link) (2013). Date of treatment September 2, 2015. Archived on September 8, 2014.

[website-2] Raft Consensus Algorithm (Neopr.) (2014).

[ya-3] YT - a new platform for distributed computing (neopr.) .