Queries in MicroRaft
April 5, 2023 | Ensar Basri Kahveci
This article is the fourth in the ins and outs of MicroRaft series. Here we walk through the querying capabilities available in MicroRaft. You can see some code samples here.
For impatient readers, TLDR is as follows:
-
MicroRaft separates mutating and query operations via
RaftNode.replicate()
andRaftNode.query()
APIs. These APIs returnOrdered
objects.Ordered
wraps the actual result of an operation along with the commit index at which the operation is executed. MicroRaft offers 3 policies to execute queries with different consistency and performance characteristics viaQueryPolicy
:QueryPolicy.LINEARIZABLE
,QueryPolicy.LEADER_LEASE
, andQueryPolicy.EVENTUAL_CONSISTENCY
. -
QueryPolicy.LINEARIZABLE
achieves linearizable reads without growing to the internal Raft log. Before executing a query, the leader Raft node runs a round of AppendEntries RPC to ensure that it has not superseded by another Raft node. This approach prevents the log to grow because of queries and avoids fsync cost. -
QueryPolicy.LEADER_LEASE
enables the leader to locally execute queries without communicating with the other Raft nodes. The leader maintains a lease on top of the periodic heartbeating mechanism. In short, as long as the clock skews, network delays and processing delays do not break the timing assumptions around the leader heartbeat timeout parameter, MicroRaft ensures that a new leader is not elected while the current leader is alive. Thus, the leader Raft node can achive linearizable reads without talking to the other nodes for queries. -
QueryPolicy.EVENTUAL_CONSISTENCY
allows running a query operation locally on any Raft node. This policy helps clients to scale their query workload by distributing queries over multiple Raft nodes. In addition, clients can track commit indices they observe viaOrdered
objects, and pass them toRaftNode.query()
calls to achieve monotonic reads and read your writes.
Interested reads can keep reading to learn more details.
An operation passed to RaftNode.replicate()
on the leader is committed and
executed once it is appended to the majority of the Raft group.
If the operation is read-only, i.e., it performs a query on the state machine
instead of mutating it, we still pay the cost of a disk write and fsync since
a new log entry is appended to the log. Moreover, if too many queries are
appended, MicroRaft can trigger the snapshotting mechanism to truncate the log
even if there is little change in the state machine. This is definitely
a sub-optimal situation.
Section 6.4 of the Raft dissertation discusses a number of techniques to bypass the log and process queries with better performance. However, we should be careful about the consistency of the state we observe via queries. Performance and consistency is usually traded for each other, and Raft is no exception.
Here we explore how MicroRaft implements the techniques discussed in the Raft
dissertation. MicroRaft optimizes querying with 3 options, ranging from the one
with the smallest performance benefit and strongest consistency guarantee
to the one with the greatest performance benefit and weakest consistency
guarantee which allows achieving monotonic reads. For this, MicroRaft separates
queries from mutating operations with an API: RaftNode.query()
. This API
expects a QueryPolicy
, which is an enum with 3 possible values:
QueryPolicy.LINEARIZABLE
, QueryPolicy.LEADER_LEASE
, and
QueryPolicy.EVENTUAL_CONSISTENCY
.
QueryPolicy.LINEARIZABLE
We can use this policy achieve linearizability queries by bypassing the log.
When the leader Raft node receives a query, it saves the current commit index
and the query into an internal query state object. Then, it sends an
AppendEntriesRequest
to followers and waits for a majority acknowledgement. This request does not
contain the query itself and a small piece of information about waiting queries
are piggybacked into regular AppendEntriesRequest
objects via
AppendEntriesRequest.getQuerySequenceNumber()
.
Query sequence number works as a logic clock in each term and is incremented
every time a round of AppendEntries RPC is initiated for a query. Once
the majority Raft nodes acknowledge the current query sequence number,
the leader Raft node learns that there was no other leader at the moment
AppendEntriesRequest
s were sent, hence executes the query on the state
machine and replies to the client.
MicroRaft amortizes the cost of leadership confirmation for multiple queries.
The leader accumulates all new queries received while there is an ongoing
AppendEntries RPC round for another query and executes all queries once
it receives a majority ack. To prevent OOMs on the leader RaftNode, MicroRaft
limits the number of pending queries by
RaftConfig.getMaxPendingLogEntryCount()
.
If there is no available query slot, RaftNode.query()
fails with
CannotReplicateException
,
so that clients can retry their queries later.
The leader Raft node must have committed a log entry on its term to make this
mechanism work. This is because the leader may not know the current commit index
after it is elected. It appends a custom entry to the log and replicates this
entry to discover the current commit index. In MicroRaft, this log entry
contains the operation provided by StateMachine.getNewTermOperation()
.
RaftNode.query()
calls fail with CannotReplicateException
when this
operation is not committed yet, so that clients can retry their queries later.
It can happen that the leader is partitioned away and the other Raft nodes have
already elected a new leader. In this case, the old leader cannot succeed
leadership confirmation for inflight queries. In MicroRaft, a leader
automatically steps down from leadership [TODO LINK HERE]
if the leader heartbeat timeout elapses before it receives
AppendEntries RPC responses from the half of the Raft group. It also fails
all pending queries with NotLeaderException
to enable clients retry their queries on other Raft nodes.
This overall mechanism is more efficient than appending queries to the log, since it avoids disk writes and fsync for queries. However, the leader still waits for a round of AppendEntries RPC to preserve linearizability.
QueryPolicy.LEADER_LEASE
QueryPolicy.LINEARIZABLE
achieves linearizability for queries by performing a
round of AppendEntries RPC among the majority. This approach is still safe
when clocks go rogue or messages get arbitrarily delayed (i.e.,
the asynchronous system model). MicroRaft offers another policy,
QueryPolicy.LEADER_LOCAL
, to run queries in a linearizable manner without
communication between Raft nodes while a very strong timing assumption holds
(i.e., the synchronous system model).
In MicroRaft, a follower does not vote for another Raft node if it has received
an AppendEntriesRequest
in the last leader heartbeat timeout period. This
is called leader stickiness. Moreover, as we described above, a leader
automatically steps down from leadership if the leader heartbeat timeout
elapses before it receives AppendEntries RPC responses from the half of the
Raft group. These two techniques ensure that no new leader could be elected
while the current leader is alive, as long as clock drifts, network delays and
processing delays do not break the timing assumptions. Thus, the leader Raft
node can maintain a lease on top of the periodic heartbeating mechanism. It can
execute queries locally without communicating with the other Raft nodes and
still preserve linearizability for query results.
If the timing assumption around the leader heartbeat timeout is violated for
any reason, the leader could return stale results for locally executed queries.
Therefore, users of MicroRaft should carefully take operational challenges into
account while adjusting the leader heartbeat timeout parameter. If
linearizability is strictly required, it is better to use
QueryPolicy.LINEARIZABLE
instead of QueryPolicy.LEADER_LEASE
.
QueryPolicy.EVENTUAL_CONSISTENCY
QueryPolicy.LINEARIZABLE
and QueryPolicy.LEADER_LEASE
require a query to be
executed by the leader Raft node. We can also execute queries on followers to
distribute the read workload via the last query policy option:
QueryPolicy.EVENTUAL_CONSISTENCY
. Any Raft node can execute a query
with QueryPolicy.EVENTUAL_CONSISTENCY
and return the result back to caller.
This policy does not guarantee to run queries on the most up to date state
machine, hence queries can return stale results. However, clients can still
preserve monotonic reads and read your writes guarantees for queries.
For monotonic reads, a simple solution is to make a client send its queries to
the same Raft node. This would be sufficient to preserve monotonicity for
the client as long as the Raft node is alive. However, if the Raft node fails,
the client can switch to another Raft node that is not as up-to-date as
the failed one, and its further queries return non-monotonic results. MicroRaft
offers a solution to prevent this problem. RaftNode.replicate()
and
RaftNode.query()
APIs return Ordered
objects. Ordered
wraps the actual
result of an operation along with the commit index at which the operation is
executed. Clients can track commit indices reported via Ordered
objects
returned from queries and use them to ensure monotonic reads with this policy.
RaftNode.query()
contains two optional parameters: minCommitIndex
and
timeout
for this. If the commit index of a Raft node becomes greater than or
equal to the requested minimum commit index for the given timeout duration,
the Raft node locally executes the query. Otherwise, it fails the query with
LaggingCommitIndexException
so that the client can retry on another Raft
node. Clients can tune the timeout parameter to increase the likelihood of
a follower / learner Raft node to apply recently committed log entries.
Clients can achieve read your writes with a similar approach if they are also
doing mutations via RaftNode.replicate()
in addition to queries. A client
can send its mutations to the leader Raft node, learn the commit index of its
mutation via the returned Ordered
object, and pass this commit index to
queries it sends to the followers to ensure that it will read its own writes.
Note: Before the version 0.6, MicroRaft also had
QueryPolicy#BOUNDED_STALENESS
which was incorrectly named for the guarantees
it offered. This policy was deleted in this commit.
What is next?
In the next post, we will explore how MicroRaft implements membership changes.