Weekly (Issue 14): Rereading the Cluster Membership Change Algorithm in Raft Papers (2): Practice


Introduction: When I read Raft’s big paper before, I didn’t understand the part of “cluster change”. So I recently re-read this part of the big thesis, and the following are some records I made during the re-reading. This part of the content is intended to be divided into two articles, the first one explains the theoretical basis of the member change process, and the second one explains the problems in practice.


Rereading the Cluster Membership Change Algorithm in Raft Papers (2): Practice

Problems with single-step member changes

correctness problem

Correctness issues can arise when changing members in a single step. As shown in the following example, at the beginning, the members of the system are a set of four nodes {a,b,c,d} . To add nodes u and v to the cluster, according to the method of changing members in a single step, the Experience: {a,b,c,d} -> {a,b,c,d,u} -> {a,b,c,d,u,v} changes, each time a node is added to the cluster inside.

The above steps look great, but consider the following example where the leader node has changed during the change process:

 C₀ = {a, b, c, d} Cᵤ = C₁ ∪ {u} Cᵥ = C₁ ∪ {v} Lᵢ: Leader in term `i` Fᵢ: Follower in term `i` ☒ : crash | u | Cᵤ F₂ Cᵤ --- | ---------------------------------- a | C₀ L₀ Cᵤ ☒ L₂ Cᵤ b | C₀ F₀ F₁ F₂ Cᵤ c | C₀ F₀ F₁ Cᵥ Cᵤ d | C₀ L₁ Cᵥ ☒ Cᵤ --- | ---------------------------------- v | Cᵥ time +--------------------------------------------> t₁ t₂ t₃ t₄ t₅ t₆ t₇ t₈

(Quoted from the pit that TiDB stepped on on Raft member changes – OpenACID Blog )

In the above process, the ordinate is the node in the cluster, the abscissa is the different time (note that it is not a term), Li represents the leader node during term i, and Fi represents the follower node during term i.

The above process is described as follows:

  • t1: Node a is elected as the leader of term 0, while nodes b and c are followers.
  • t2: The node u is added to the cluster, but this log of joining the cluster only reaches nodes a and u. Since more than half of the logs have not passed, node u has not yet successfully joined the cluster at this time.
  • t3: Node a is down.
  • t4: Since the original leader is down, the cluster needs to elect a new leader. The new leader selected is node d, which is the leader at term 1.
  • t5: Node v joins the cluster, and the log of joining the cluster reaches nodes c, d, and v. It can be seen that this log is on more than half of the nodes in the cluster at this time (because node a is down at this time, so there are only three 1 node is serving, so only 2 nodes agree that it can be submitted), so it has actually been submitted, that is, the operation of v joining the cluster is successful.
  • t6: leader d is down.
  • t7: The downed node a restores the service and sees that there is a log of adding node u to the cluster locally, so it considers that node u and b are follower nodes for this term.
  • t8: At this time, node d resumes service, and leader a will synchronize the logs that node u joined to the cluster to all nodes in the current cluster, which causes the logs that v joined the cluster and submitted before are lost.

The essence of this problem is because the change log of the previous leader has not been synchronized to more than half of the nodes in the cluster, and it goes down. At this time, the new leader changes members, which leads to the formation of two different clusters, resulting in split brain. The logs that have already been committed are overwritten.

The Raft authors describe this phenomenon in a bug in single-server membership changes .

The solution is also very simple: that is, each time the newly elected leader is not allowed to submit its local log directly, it must submit a no-op log before it can start synchronization. The description of this problem is described in the previous blog: Why can’t the Raft protocol submit the log of the previous term? – codedump web log

usability issues

In addition to the above correctness problems, single-step changes may also have availability problems: when the node to be replaced is in the same computer room, if the network of this computer room is disconnected from the network of other computer rooms in the cluster, the leader cannot be elected, so that the The cluster cannot provide services. Take a look at the example below.

availability

In the above figure, there are three nodes in the original cluster located in three computer rooms: node a in computer room 1, node b in computer room 2, and node c in computer room 3. Now due to various reasons, I want to take node a in computer room 1 offline and replace it with node d in the same computer room to continue serving in the cluster.

It can be seen that this replacement operation involves the addition of a node and the departure of a node. There may be two possible steps as follows:

  • First add a new node d and then delete node a: {a,b,c} -> {a,b,c,d} -> {b,c,d} .
  • First delete node a and then add new node d: {a,b,c} -> {b,c} -> {b,c,d} .

The two steps have their own advantages and disadvantages. The problem with the second solution is that only two nodes are serving in the middle. Once the downtime occurs again at this time, the cluster will be unavailable.

In the first solution, according to the example in the figure above, if the nodes a and d to be replaced are located in the same computer room, then if the network of this computer room is also isolated from other computer rooms, then only two nodes are serving, which At the same time, it cannot be serviced under the condition of four nodes (intermediate step).

These are two types of problems that can arise in a single-step change. As you can see, although the single-step change algorithm looks simple to implement, there are actually many details to pay attention to. Although the Raft paper argues that a single-step change is a simpler approach, mainstream implementations now use the Joint Consensus algorithm.

How the Joint Consensus Algorithm Solve Usability Problems

For the above-mentioned problem of replacing different nodes in the same computer room, the cluster may not be available (the leader cannot be selected) due to the isolation of the computer room by the network in the middle process. Let’s see how the Joint Consensus algorithm solves it.

Let’s review the steps first. If you use the Joint Consensus algorithm, you need to go through two stages of submission:

  • Commit C_Old $\bigcup$ C_New .
  • Then submit C_New .

Replacing the collection with the example here is:

  • Commit {a,b,c} $\bigcup$ {a,b,c,d} first.
  • Then commit {a,b,c,d} .

Let’s look at the possible downtime in these two phases:

  • In the first stage, the leader node is down. There are only two possible cases for this leader node. Its cluster configuration is still C_Old , or it has received C_Old $\bigcup$ C_New :
    • C_Old : Since the leader does not have the C_Old $\bigcup$ C_New node set change submitted in the first stage at this time, the logs of the followers who already have C_Old $\bigcup$ C_New node set will be truncated, and the member change will fail. Return to the C_Old collection.
    • C_Old $\bigcup$ C_New : This means that the leader has already changed the set of C_Old $\bigcup$ C_New nodes submitted in the first stage, and can continue to complete the unfinished member change process.

Similarly, when the leader node goes down in the second stage, it will not lead to the failure of the leader to be elected, which can be derived in a similar way.

It can be seen that using the Joint Consensus algorithm directly does not have the usability problem of single-step change.

Summarize

  • The single-step change algorithm of Raft cluster, although it looks “simple”, there are many details that need to be paid attention to in practice.
  • Although the paper mentions that the single-step change algorithm is simpler than the Joint Consensus algorithm, many open source Raft implementations already use the Joint Consensus algorithm as the default implementation.

(I wrote the implementation analysis of etcd 3.5 before, see: etcd 3.5 joint consensus implementation analysis – codedump’s network log )

References

This article is reprinted from: https://www.codedump.info/post/20220507-weekly-14/
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment