Implementing the log
November 12, 2022 | Ensar Basri Kahveci
This article is the second in the ins and outs of MicroRaft series. Here we dissect the log.
MicroRaft implements the log with 2 components: RaftLog
and RaftStore
.
RaftLog is an internal component that keeps log entries in memory. On the other
hand, RaftStore is a public API. It is implemented by users. RaftNode does not
interact with RaftStore directly. RaftNode manipulates the log via RaftLog, and
RaftLog calls RaftStore to reflect changes to disk.
RaftLog is implemented as a fixed-size ring-buffer. As shown in Figure 1, it consists of 3 sections from head to tail: snapshotted, committed, and in progress.
Appending new log entries
New log entries are appended to the in progress section at the tail. Its size
is specified by RaftConfig.getMaxPendingLogEntryCount()
. Log entries reside in
this section until they are committed or truncated. When a client calls
RaftNode.replicate()
, if the in progress section has no empty slots, the
client gets a CannotReplicateException
. This exception means that the
operation is not appended to the log because there are too many in progress log
entries at the moment. A client can retry its operation later upon receiving
this exception. By this way, we prevent OOME on a leader Raft node if it cannot
keep up with the request rate. The size of the in progress section should be
decided by taking the degree of the concurrency of the clients into account.
Log entries in the in progress section can be truncated before they are
committed in some failure scenarios. Consider the scenario where a leader
appends a new log entry to an index, but disconnects from the rest of the Raft
group before it replicates the new log entry to the other Raft nodes. After the
leader heartbeat timeout elapses, the other Raft nodes can elect a new leader
and the new leader can append a new log entry to the same log index. Once the
previous leader reconnects to the other Raft nodes, it notices that a new
leader has appended a new log entry to the same log index. In this case, the
client of the initial log entry gets a NotLeaderException
. This exception
means that the client's operation is not committed and the client can retry its
operation on the new leader safely.
Committing log entries
The commit index moves from head to tail. Once a log entry is committed, it is
moved to the committed section. A new snapshot is taken from StateMachine
when the committed section is full. The size of the committed section is
specified by RaftConfig.getCommitCountToTakeSnapshot()
. Since all Raft nodes
are created with the same config, they take snapshots at the same log indices.
For instance, when RaftConfig.getCommitCountToTakeSnapshot()
is 10000, Raft
nodes take snapshots at log indices: 10000, 20000, 30000, and so on. This
deterministic behaviour enables a snapshotting optimization which we will cover
later.
Truncating committed log entries after snapshots
Recall that log entries are committed once they are replicated to the majority.
When a leader Raft node decides to take a snapshot, there can be some Raft nodes
that have not received the recently committed log entries yet. If the leader
immediately deletes all log entries preceding the snapshot, it may need to send
a snapshot to those Raft nodes instead of the recent log entries. MicroRaft
applies a simple heuristic to prevent this situation. Upon taking a new
snapshot, the leader Raft node checks the smallest match index of the minority
followers. If the difference between the leader's commit index and the smallest
follower match index is less than 0.1 *
RaftConfig.getCommitCountToTakeSnapshot()
, the leader moves all log entries
after that match index to the snapshotted section. If there is no such
follower, i.e., all minority followers are far behind, the leader truncates all
log entries preceding the snapshotted log index.
Followers apply the same heuristic with a slight difference. Since followers do
not know each other's match indices, when a follower Raft node takes a snapshot,
it moves the 0.1 * RaftConfig.getCommitCountToTakeSnapshot()
log entries
preceding the snapshot index to the snapshotted section. This behaviour is
useful in case the current leader crashes and a follower becomes the new leader
just after it takes a snapshot.
Assume that RaftConfig.getCommitCountToTakeSnapshot()
is 1000 and the current
commit index is 5000. If there is any follower whose match index is greater than
or equal to 4500 when the leader takes a snapshot, the leader moves the log
entries after that match index into the snapshotted area. Otherwise, the
leader truncates all log entries preceding 5000. When a follower takes a
snapshot, it moves the log entries between 4500 and 5000 into the snapshotted
area.
Amortizing the cost of disk writes
RaftStore is designed to amortize the cost of disk writes. RaftStore has 2
methods to reflect changes in RaftLog to disk: RaftStore.persistLogEntry()
and
RaftStore.truncateLogEntriesFrom()
. Both of these methods are called before a log entry
is committed. RaftStore implementations are allowed to buffer disk writes caused
by these methods instead of immediately flushing them. There is also a third
method to flush all buffered disk writes: RaftStore.flush()
. When this method
is called by Raft node, the RaftStore implementation must guarantee the
durability of all buffered writes on the disk.
Since flushing typically involves costly fsync
calls on the kernel level,
MicroRaft amortizes the cost of disk writes by performing multiple
RaftStore.persistLogEntry()
calls before a RaftStore.flush()
call.
Wrap up
In this article, we explored the details of MicroRaft's log implementation. Next, we will investigate how MicroRaft realizes log replication.