Resiliency and Fault Tolerance
In this section, we will walk through different types of failure scenarios and discuss how MicroRaft handles each one of them. We will use MicroRaft's local testing utilities to demonstrate those failure scenarios. These utilities are mainly used for testing MicroRaft to a great extent without a distributed setting. Here, we will use them to run a Raft group in a single JVM process and inject different types of failures into the system.
In terms of safety, the fundamental guarantee of the Raft consensus algorithm and hence MicroRaft is, operations are committed in a single global order, and a committed operation is never lost, as long as there is no Byzantine failure in the system. In MicroRaft, restarting a Raft node that has no persistence layer with the same identity or restarting it with a corrupted persistence state are examples of Byzantine failure.
The availability of a Raft group mainly depends on if the majority (i.e., more
than half) of the Raft nodes are alive and able to communicate with each other.
The main rule is, 2f + 1
Raft nodes tolerate failure of f
Raft nodes. For
instance, a 3-node Raft group can tolerate failure of 1 Raft node, or a 5-node
Raft group can tolerate failure of 2 Raft nodes without losing availability.
1. Handling high system load
Even if the majority of a Raft group is alive, we may encounter unavailability
issues if the Raft group is under high load and cannot keep up with the request
rate. In this case, the leader Raft node temporarily stops accepting new
requests and notifies the futures returned from the RaftNode
methods with CannotReplicateException
. This exception means that there
are too many operations pending to be committed in the leader's local Raft log,
or too many queries pending to be executed, so it temporarily rejects accepting
new requests. Clients should apply some backoff before retrying their requests.
We will demonstrate this scenario in a test below with a 3-node Raft group. In
MicroRaft, a leader does not replicate log entries one by one. Instead, it keeps
a buffer for incoming requests and replicates the log entries to the followers
in batches in order to improve the throughput. Once this buffer is filled up,
the leader stops accepting new requests. In this test, we allow the pending log
entries buffer to keep at most 10 requests. We also slow down our followers
synthetically by making them sleep for 3 seconds. Then, we start sending
requests to the leader. After some time, our requests fail with
CannotReplicateException
.
To run this test on your machine, try the following:
$ gh repo clone MicroRaft/MicroRaft
$ cd MicroRaft && ./mvnw clean test -Dtest=io.microraft.faulttolerance.HighLoadTest -DfailIfNoTests=false -Ptutorial
You can also see it in the MicroRaft Github repository.
2. Minority failure
Failure of the minority (i.e, less than half) may cause the Raft group to lose
availability temporarily, but eventually the Raft group continues to accept and
commit new requests. If we have a persistence implementation (i.e, RaftStore
),
we can recover failed Raft nodes. On the other hand, if we don't have
persistence or cannot recover the persisted Raft data, we can remove failed Raft
nodes from the Raft group. Please note that when we remove a Raft node from a
Raft group, the majority quorum size is re-calculated based on the new size of
the Raft group. In order to replace a non-recoverable Raft node without hurting
the overall availability of the Raft group, we should remove the crashed Raft
node first and then add a fresh-new one.
If Raft nodes are
created without an actual RaftStore
implementation in the beginning,
restarting crashed Raft nodes with the same Raft endpoint identity breaks the
safety of the Raft consensus algorithm. Therefore, when there is no persistence
layer, the only recovery option for a failed Raft node is to remove it from the
Raft group, which is possible only if the majority of the Raft group is up and
running.
To restart a crashed or terminated Raft node, we can read its persisted state
into a RestoredRaftState
object. Then, we can use this object to restore the
Raft node back. Please note that terminating a Raft node manually without a
persistence layer implementation is equivalent to a crash since there is no way
to restore the Raft node back with its Raft state.
MicroRaft provides a basic in-memory RaftStore
implementation to enable
crash-recovery testing. In the following code sample, we use this utility, i.e.,
InMemoryRaftStore
, to demonstrate how to recover from Raft node failures.
To run this test on your machine, try the following:
$ gh repo clone MicroRaft/MicroRaft
$ cd MicroRaft && ./mvnw clean test -Dtest=io.microraft.faulttolerance.RestoreCrashedRaftNodeTest -DfailIfNoTests=false -Ptutorial
You can also see it in the MicroRaft Github repository.
This time we provide a factory object to enable LocalRaftGroup
to create
InMemoryRaftStore
objects while configuring our Raft nodes. Hence, after we
terminate our Raft nodes, we will be able to read their persisted state. Once we
start the Raft group, we commit a value via the leader, observe that value with
a local query on a follower, and crash a follower. Then, we read its persisted
state via our InMemoryRaftStore
object and restore the follower back. Please
ignore the details of RaftTestUtils.getRestoredState()
and
RaftTestUtils.getRaftStore()
. Once the follower starts running again, it talks
to the other Raft nodes, discovers the current leader Raft node and its commit
index, and replays all committed operations on its state machine.
Our sysout
lines in this test print the following:
replicate result: value, commit index: 1
monotonic local query successful on follower. query result: value, commit index: 1
monotonic local query successful on restarted follower. query result: value, commit index: 1
When a Raft node starts with a restored Raft state, it discovers the current commit index and replays the Raft log, i.e., automatically applies all the log entries up to the commit index. We should be careful about operations that have side effects because the Raft log replay process triggers those side effects again. Please refer to the State Machine for more details.
3. Raft leader failure
When a leader Raft node fails, the Raft group temporarily loses availability until the other Raft nodes notice the failure and elect a new leader. Delay of the detection of the leader's failure depends on the leader heartbeat timeout configuration. Please refer to the Configuration section to learn more about the leader election timeout and leader heartbeat timeout configuration parameters.
If a client notices that the current leader is not responding, it can contact
other Raft nodes in the Raft group in a round-robin fashion and query the leader
via the RaftNode.getReport()
API. If the leader actually crashes, the
followers eventually notice its failure and elect a new leader. Then, our client
will be able to discover the new leader Raft endpoint via this API. However, if
a client cannot communicate with an alive leader because of an environmental
issue, such as a network problem, it cannot replicate new operations, or run
QueryPolicy.LINEARIZABLE
and QueryPolicy.LEADER_LOCAL
queries. It means that
the Raft group is unavailable for this particular client. This is due to
MicroRaft's simplicity-oriented design philosophy. In MicroRaft, when a follower
Raft node receives an API call that requires the leadership role, it does not
internally forward the call to the leader Raft node. Instead, it fails the call
with NotLeaderException
. Please note that this mechanism can be also used for
leader discovery. When a client needs to discover the leader, it can try talking
to any Raft node. If its call fails with NotLeaderException
, the client can
check if the exception points the current leader Raft endpoint via
NotLeaderException.getLeader()
. Otherwise, it can try the same with another
Raft node.
If a Raft leader crashes before a client receives response for an operation
passed to RaftNode.replicate()
, there are multiple possibilities:
-
If the leader failed before replicating the operation to any follower, then the operation certainly won't be committed.
-
If the failed leader replicated the operation to at least one follower, then the operation might be committed if a follower having that operation becomes leader. However, another follower could become the new leader and overwrite that operation if it was not replicated to the majority by the crashed leader.
-
The good thing about queries is, they are idempotent. Clients can safely retry their queries on the new leader.
It is up to the client
to retry an operation whose result is not received, because a retry could cause
the operation to be committed twice based on the actual failure scenario.
MicroRaft goes for simplicity and does not employ deduplication (I have plans to
implement an opt-in deduplication mechanism in future). If deduplication is
needed, it can be done inside StateMachine
implementations for now.
We will see the second scenario described above in a code sample. In the following test, we replicate an operation via the Raft leader, but block the responses sent back from the followers. Even though the leader managed to replicate our operation to the majority, it is not able to commit it because it couldn't learn that the followers also appended this operation to their logs. At this step, we crash the leader. We won't get any response for our operation now since the leader is gone, so we will just re-replicate it with the new leader. The thing is, the previous leader managed to replicate our first operation to the majority, so the new leader will commit it. Since we replicate it for the second time with the new leader, we cause a duplicate commit. When we query the new leader, we see that there are 2 values applied to the state machine.
To run this test on your machine, try the following:
$ gh repo clone MicroRaft/MicroRaft
$ cd MicroRaft && ./mvnw clean test -Dtest=io.microraft.faulttolerance.RaftLeaderFailureTest -DfailIfNoTests=false -Ptutorial
You can also see it in the MicroRaft Github repository.
Another trick could be designing our operations in an idempotent way and retry them automatically on leader failures, because duplicate commits do not make any harm for idempotent operations. However, it is not very easy to make every type of operation idempotent.
4. Majority failure
Failure of the majority causes the Raft group to lose its availability and stop handling new requests. The only recovery option is to recover some of failed Raft nodes so that the majority becomes available again. Otherwise, the Raft group cannot be recovered. MicroRaft does not support any unsafe recovery policy for now.
The duration of unavailability depends on how long the majority Raft nodes
remain crashed. Clients won't be able to replicate any new operations or run
linearizable queries in the meantime. However, we can still run local queries
because QueryPolicy.EVENTUAL_CONSISTENCY
does not require availability of
the majority.
In MicroRaft, on each heartbeat tick a leader Raft node checks if it is still in
charge, i.e, it has received Append Entries RPC responses from majority
quorum size - 1
(majority quorum size minus the leader itself) in the last
leader heartbeat timeout period. For instance, in a 3-node Raft group with 5
seconds of leader heartbeat timeout, a Raft leader keeps its leadership role
as long as at least 1 follower has sent an Append Entries RPC response in the
last 5 seconds. Otherwise, the leader Raft node demotes itself to the follower
role and fails pending (i.e., locally appended but not yet committed) operations
with IndeterminateStateException
. This behaviour is due to the
asynchronous nature of distributed systems. When the leader cannot get Append
Entries RPC responses from some of its followers, it may not accurately decide
if those followers are actually crashed, or just temporarily unreachable. If
those unresponsive followers are actually alive and can form the majority, they
can also elect a new leader among themselves and commit operations replicated by
the previous leader. Hence, MicroRaft takes a defensive approach here and makes
a leader Raft node step down from the leadership role.
It is up to the client
to retry an operation which is notified with IndeterminateStateException
,
because a retry could cause the operation to be committed twice. MicroRaft goes
for simplicity and does not employ deduplication (I have plans to implement an
opt-in deduplication mechanism in future). If deduplication is needed, it can be
done inside StateMachine
implementations for now.
We will see another code sample to demonstrate how to restore from majority
failure. In this part we use the InMemoryRaftStore
utility we used in
RestoreCrashedRaftNodeTest
above. We start a 3-node Raft group, commit an
operation, and terminate both of our 2 followers. Then, we try to replicate a
new operation. However, in a few seconds the leader will notice that it has not
received Append Entries RPC responses from the majority and step down from the
leadership role. Because of that, it will also fail our operation with
IndeterminateOperationStateException
. Since it is a follower now, it will
directly reject new RaftNode.replicate()
calls with NotLeaderException
. At
this point, our Raft group is unavailable for RaftNode.replicate()
calls, and
RaftNode.query()
calls for QueryPolicy.LINEARIZABLE
and
QueryPolicy.LEADER_LEASE
, but we can still perform a local query with
QueryPolicy.EVENTUAL_CONSISTENCY
. If we want to make the Raft group available
again, we don't need to restore all crashed Raft nodes. In this particular
scenario, it is sufficient to restore only 1 Raft node so that we will have
the majority alive again. It is what we do in the last part of the test. Once
we have 2 Raft nodes running again, they will be able to elect a new leader.
In this example, we waited until the leader demotes itself to the follower role before restarting the crashed Raft nodes. This is not a requirement for restoring crashed Raft nodes, and we did it here only for the sake of example. We can restore a crashed Raft node anytime and if the leader Raft node is still running there may not be a new leader election round and the restarted Raft node could just discover the leader Raft node. Our Raft group will restore its availability as long as there is a leader Raft node taking to the majority (including itself).
To run this test on your machine, try the following:
$ gh repo clone MicroRaft/MicroRaft
$ cd MicroRaft && ./mvnw clean test -Dtest=io.microraft.faulttolerance.MajorityFailureTest -DfailIfNoTests=false -Ptutorial
You can also see it in the MicroRaft Github repository.
Please note that you
need to have a persistence-layer (i.e., RaftStore
implementation) to make this
recovery option work. If a crashed Raft node is restarted with the same identity
but an empty state, it turns into a Byzantine-failure scenario, where already
committed operations can be lost and consistency of the system can be broken.
Please see the Corruption or loss of persistent Raft
state part for more details.
5. Network partitions
Behaviour of a Raft group during a network partition depends on how Raft nodes are divided to different sides of the network partition, and with which Raft nodes our clients are interacting with. If any subset of the Raft nodes manage to form the majority, they remain available. If the Raft leader falls into the minority side, the Raft nodes in the majority side elect a new leader and restore their availability.
If our clients cannot talk to the majority side, it means that the Raft group is unavailable from the perspective of the clients.
Similar to the majority failure case described in the previous part, if the
leader Raft node falls into a minority side of the network partition, it demotes
itself to the follower role after the leader heartbeat timeout elapses, and
fails all pending operations with IndeterminateStateException
. To reiterate,
this exception means that the demoted leader cannot decide if those operations
have been committed or not.
When the network problem is resolved, Raft nodes connect to each other again. The Raft nodes that was on the minority side of the network partition catch up with the other Raft nodes, and the Raft group continues its normal operation.
One of the key points of the Raft consensus algorithm's and hence MicroRaft's network partition behaviour is the absence of split-brain. In any network partition scenario, there can be at most one functional leader.
We will see how our Raft nodes behave in a network partitioning scenario in the following test. Again, we have a 3-node Raft group here, and we create an artificial network disconnection between the leader and the followers. Since the leader cannot talk to the majority anymore, after the leader heartbeat timeout duration elapses, our leader demotes to the follower role. The followers on the other side elect a new leader among themselves and even commit a new operation. Once we fix the network problem, we see that the old leader connects back to the other Raft nodes, discovers the new leader and gets the new committed operation. Phew!
To run this test on your machine, try the following:
$ gh repo clone MicroRaft/MicroRaft
$ cd MicroRaft && ./mvnw clean test -Dtest=io.microraft.faulttolerance.NetworkPartitionTest -DfailIfNoTests=false -Ptutorial
You can also see it in the MicroRaft Github repository.
6. Corruption or loss of persistent Raft state
If a RestoredRaftState
object is created with corrupted or partially-restored
Raft state, the safety guarantees of the Raft consensus algorithm no longer
hold. For instance, if a flushed log entry is not present in the
RestoredRaftState
object, then the restored RaftNode
may not have a
committed operation. If that Raft node becomes leader, it may commit another
operation for the same log index with the lost operation and breaks the safety
property of the Raft consensus algorithm.
RaftStore
documents all the durability and integrity
guarantees required by its implementations. Hence, it is the responsibility of
RaftStore
implementations to ensure durability and integrity of the persisted
Raft state. RaftNode
does not perform any error checks when they are restored
with RestoredRaftState
objects.