Queries in MicroRaft

April 5, 2023 | Ensar Basri Kahveci

This article is the fourth in the ins and outs of MicroRaft series. Here we walk through the querying capabilities available in MicroRaft. You can see some code samples here.

For impatient readers, TLDR is as follows:

  • MicroRaft separates mutating and query operations via RaftNode.replicate() and RaftNode.query() APIs. These APIs return Ordered objects. Ordered wraps the actual result of an operation along with the commit index at which the operation is executed. MicroRaft offers 3 policies to execute queries with different consistency and performance characteristics via QueryPolicy: QueryPolicy.LINEARIZABLE, QueryPolicy.LEADER_LEASE, and QueryPolicy.EVENTUAL_CONSISTENCY.

  • QueryPolicy.LINEARIZABLE achieves linearizable reads without growing to the internal Raft log. Before executing a query, the leader Raft node runs a round of AppendEntries RPC to ensure that it has not superseded by another Raft node. This approach prevents the log to grow because of queries and avoids fsync cost.

  • QueryPolicy.LEADER_LEASE enables the leader to locally execute queries without communicating with the other Raft nodes. The leader maintains a lease on top of the periodic heartbeating mechanism. In short, as long as the clock skews, network delays and processing delays do not break the timing assumptions around the leader heartbeat timeout parameter, MicroRaft ensures that a new leader is not elected while the current leader is alive. Thus, the leader Raft node can achive linearizable reads without talking to the other nodes for queries.

  • QueryPolicy.EVENTUAL_CONSISTENCY allows running a query operation locally on any Raft node. This policy helps clients to scale their query workload by distributing queries over multiple Raft nodes. In addition, clients can track commit indices they observe via Ordered objects, and pass them to RaftNode.query() calls to achieve monotonic reads and read your writes.

Interested reads can keep reading to learn more details.

An operation passed to RaftNode.replicate() on the leader is committed and executed once it is appended to the majority of the Raft group. If the operation is read-only, i.e., it performs a query on the state machine instead of mutating it, we still pay the cost of a disk write and fsync since a new log entry is appended to the log. Moreover, if too many queries are appended, MicroRaft can trigger the snapshotting mechanism to truncate the log even if there is little change in the state machine. This is definitely a sub-optimal situation.

Section 6.4 of the Raft dissertation discusses a number of techniques to bypass the log and process queries with better performance. However, we should be careful about the consistency of the state we observe via queries. Performance and consistency is usually traded for each other, and Raft is no exception.

Here we explore how MicroRaft implements the techniques discussed in the Raft dissertation. MicroRaft optimizes querying with 3 options, ranging from the one with the smallest performance benefit and strongest consistency guarantee to the one with the greatest performance benefit and weakest consistency guarantee which allows achieving monotonic reads. For this, MicroRaft separates queries from mutating operations with an API: RaftNode.query(). This API expects a QueryPolicy, which is an enum with 3 possible values: QueryPolicy.LINEARIZABLE, QueryPolicy.LEADER_LEASE, and QueryPolicy.EVENTUAL_CONSISTENCY.

QueryPolicy.LINEARIZABLE

We can use this policy achieve linearizability queries by bypassing the log. When the leader Raft node receives a query, it saves the current commit index and the query into an internal query state object. Then, it sends an AppendEntriesRequest to followers and waits for a majority acknowledgement. This request does not contain the query itself and a small piece of information about waiting queries are piggybacked into regular AppendEntriesRequest objects via AppendEntriesRequest.getQuerySequenceNumber(). Query sequence number works as a logic clock in each term and is incremented every time a round of AppendEntries RPC is initiated for a query. Once the majority Raft nodes acknowledge the current query sequence number, the leader Raft node learns that there was no other leader at the moment AppendEntriesRequests were sent, hence executes the query on the state machine and replies to the client.

MicroRaft amortizes the cost of leadership confirmation for multiple queries. The leader accumulates all new queries received while there is an ongoing AppendEntries RPC round for another query and executes all queries once it receives a majority ack. To prevent OOMs on the leader RaftNode, MicroRaft limits the number of pending queries by RaftConfig.getMaxPendingLogEntryCount(). If there is no available query slot, RaftNode.query() fails with CannotReplicateException, so that clients can retry their queries later.

The leader Raft node must have committed a log entry on its term to make this mechanism work. This is because the leader may not know the current commit index after it is elected. It appends a custom entry to the log and replicates this entry to discover the current commit index. In MicroRaft, this log entry contains the operation provided by StateMachine.getNewTermOperation(). RaftNode.query() calls fail with CannotReplicateException when this operation is not committed yet, so that clients can retry their queries later.

It can happen that the leader is partitioned away and the other Raft nodes have already elected a new leader. In this case, the old leader cannot succeed leadership confirmation for inflight queries. In MicroRaft, a leader automatically steps down from leadership [TODO LINK HERE] if the leader heartbeat timeout elapses before it receives AppendEntries RPC responses from the half of the Raft group. It also fails all pending queries with NotLeaderException to enable clients retry their queries on other Raft nodes.

This overall mechanism is more efficient than appending queries to the log, since it avoids disk writes and fsync for queries. However, the leader still waits for a round of AppendEntries RPC to preserve linearizability.

QueryPolicy.LEADER_LEASE

QueryPolicy.LINEARIZABLE achieves linearizability for queries by performing a round of AppendEntries RPC among the majority. This approach is still safe when clocks go rogue or messages get arbitrarily delayed (i.e., the asynchronous system model). MicroRaft offers another policy, QueryPolicy.LEADER_LOCAL, to run queries in a linearizable manner without communication between Raft nodes while a very strong timing assumption holds (i.e., the synchronous system model).

In MicroRaft, a follower does not vote for another Raft node if it has received an AppendEntriesRequest in the last leader heartbeat timeout period. This is called leader stickiness. Moreover, as we described above, a leader automatically steps down from leadership if the leader heartbeat timeout elapses before it receives AppendEntries RPC responses from the half of the Raft group. These two techniques ensure that no new leader could be elected while the current leader is alive, as long as clock drifts, network delays and processing delays do not break the timing assumptions. Thus, the leader Raft node can maintain a lease on top of the periodic heartbeating mechanism. It can execute queries locally without communicating with the other Raft nodes and still preserve linearizability for query results.

If the timing assumption around the leader heartbeat timeout is violated for any reason, the leader could return stale results for locally executed queries. Therefore, users of MicroRaft should carefully take operational challenges into account while adjusting the leader heartbeat timeout parameter. If linearizability is strictly required, it is better to use QueryPolicy.LINEARIZABLE instead of QueryPolicy.LEADER_LEASE.

QueryPolicy.EVENTUAL_CONSISTENCY

QueryPolicy.LINEARIZABLE and QueryPolicy.LEADER_LEASE require a query to be executed by the leader Raft node. We can also execute queries on followers to distribute the read workload via the last query policy option: QueryPolicy.EVENTUAL_CONSISTENCY. Any Raft node can execute a query with QueryPolicy.EVENTUAL_CONSISTENCY and return the result back to caller. This policy does not guarantee to run queries on the most up to date state machine, hence queries can return stale results. However, clients can still preserve monotonic reads and read your writes guarantees for queries.

For monotonic reads, a simple solution is to make a client send its queries to the same Raft node. This would be sufficient to preserve monotonicity for the client as long as the Raft node is alive. However, if the Raft node fails, the client can switch to another Raft node that is not as up-to-date as the failed one, and its further queries return non-monotonic results. MicroRaft offers a solution to prevent this problem. RaftNode.replicate() and RaftNode.query() APIs return Ordered objects. Ordered wraps the actual result of an operation along with the commit index at which the operation is executed. Clients can track commit indices reported via Ordered objects returned from queries and use them to ensure monotonic reads with this policy. RaftNode.query() contains two optional parameters: minCommitIndex and timeout for this. If the commit index of a Raft node becomes greater than or equal to the requested minimum commit index for the given timeout duration, the Raft node locally executes the query. Otherwise, it fails the query with LaggingCommitIndexException so that the client can retry on another Raft node. Clients can tune the timeout parameter to increase the likelihood of a follower / learner Raft node to apply recently committed log entries.

Clients can achieve read your writes with a similar approach if they are also doing mutations via RaftNode.replicate() in addition to queries. A client can send its mutations to the leader Raft node, learn the commit index of its mutation via the returned Ordered object, and pass this commit index to queries it sends to the followers to ensure that it will read its own writes.

Note: Before the version 0.6, MicroRaft also had QueryPolicy#BOUNDED_STALENESS which was incorrectly named for the guarantees it offered. This policy was deleted in this commit.

What is next?

In the next post, we will explore how MicroRaft implements membership changes.