Main Abstractions

MicroRaft's main abstractions are listed below.

RaftConfig

RaftConfig contains configuration options related to the Raft consensus algorithm and MicroRaft's implementation. Please check the Configuration section for details.

RaftNode

RaftNode runs the Raft consensus algorithm as a member of a Raft group. A Raft group is a cluster of RaftNode instances that behave as a replicated state machine. RaftNode contains APIs for replicating operations, performing queries, applying membership changes in the Raft group, handling Raft RPCs and responses, etc.

Raft nodes are identified by [group id, node id] pairs. Multiple Raft groups can run in the same environment, distributed or even in a single JVM process, and they can be discriminated from each other with unique group ids. A single JVM process can run multiple Raft nodes that belong to different Raft groups or even the same Raft group.

RaftNodes execute the Raft consensus algorithm with the Actor model. In this model, each RaftNode runs in a single-threaded manner. It uses a RaftNodeExecutor to sequentially handle API calls and RaftMessage objects sent by other RaftNodes. The communication between RaftNodes are implemented with the message-passing approach and abstracted away with the Transport interface.

In addition to these interfaces that abstract away task execution and networking, RaftNodes use StateMachine objects to execute queries and committed operations, and RaftStore objects to persist internal Raft state to stable storage to be able to recover from crashes.

All of these abstractions are explained below.

RaftEndpoint

RaftEndpoint represents an endpoint that participates to at least one Raft group and executes the Raft consensus algorithm with a RaftNode.

RaftNode differentiates members of a Raft group with a unique id. MicroRaft users need to provide a unique id for each RaftEndpoint. Other than this information, RaftEndpoint implementations can contain custom fields, such as network addresses and tags, to be utilized by Transport implementations.

RaftRole

RaftRole denotes the roles of RaftNodes as specified in the Raft consensus algorithm. Currently, MicroRaft implements the main roles defined in the paper: LEADER, CANDIDATE, and FOLLOWER. Moreover, it also implements an extension adding non-voting members: LEARNER. The WITNESS role is not implemented yet, but it's on the roadmap.

RaftNodeStatus

RaftNodeStatus denotes the statuses of a RaftNode during its own and its Raft group's lifecycle. A RaftNode is in the INITIAL status when it is created, and moves to the ACTIVE status when it is started with a RaftNode.start() call. It stays in this status until either a membership change is triggered in the Raft group, or either the Raft group or Raft node is terminated.

StateMachine

StateMachine enables users to implement arbitrary services, such as an atomic register or a key-value store, and execute operations on them. Currently, MicroRaft supports memory-based state machines with datasets in the gigabytes.

RaftNode does not deal with the actual logic of committed operations. Once a given operation is committed with the Raft consensus algorithm, i.e., it is replicated to the majority of the Raft group, the operation is passed to the provided StateMachine implementation. It is the StateMachine implementation's responsibility to ensure deterministic execution of committed operations.

Since RaftNodeExecutor ensures the thread-safe execution of the tasks submitted by RaftNodes, which include actual execution of committed user-supplied operations, StateMachine implementations do not need to be thread-safe.

RaftModel and RaftModelFactory

RaftModel is the base interface for the objects that hit network and disk. There are 2 other interfaces extending this interface: BaseLogEntry and RaftMessage. BaseLogEntry is used for representing log and snapshot entries stored in the Raft log. RaftMessage is used for Raft RPCs and their responses. Please see the interfaces inside io.microraft.model for more details. In addition, there is a RaftModelFactory interface for creating RaftModel objects with the builder pattern.

MicroRaft comes with a default POJO-style implementation of these interfaces available under the io.microraft.model.impl package. Users of MicroRaft can define serialization & deserialization strategies for the default implementation objects, or implement the RaftModel interfaces with a serialization framework, such as Protocol Buffers.

Transport

Transport is used for communicating Raft nodes with each other. Transport implementations must be able to serialize RaftMessage objects created by RaftModelFactory. MicroRaft requires a minimum set of functionality for networking. There are only two methods in this interface, one for sending a RaftMessage to a RaftEndpoint and another one for checking reachability of a RaftEndpoint.

RaftStore and RestoredRaftState

RaftStore is used for persisting the internal state of the Raft consensus algorithm. Its implementations must provide the durability guarantees defined in the interface.

If a RaftNode crashes, its persisted state could be read back from stable storage into a RestoredRaftState object and the RaftNode could be restored back. RestoredRaftState contains all the necessary information to recover RaftNode instances from crashes.

RaftStore does not persist internal state of StateMachine implementations. Upon recovery, a RaftNode starts with an empty state of the state machine, discovers the current commit index and re-executes all the operations in the Raft log up to the commit index to re-populate the state machine.

RaftNodeExecutor

RaftNodeExecutor is used by RaftNode to execute the Raft consensus algorithm with the Actor model.

A RaftNode runs by submitting tasks to its RaftNodeExecutor. All tasks submitted by a RaftNode must be executed serially, with maintaining the happens-before relationship, so that the Raft consensus algorithm and the user-provided state machine logic could be executed without synchronization.

MicroRaft contains a default implementation, DefaultRaftNodeExecutor. It internally uses a single-threaded ScheduledExecutorService and should be suitable for most of the use-cases. Users of MicroRaft can provide their own RaftNodeExecutor implementations if they want to run RaftNodes in their own threading system according to the rules defined by MicroRaft.

RaftException

RaftException is the base class for Raft-related exceptions. MicroRaft defines a number of custom exceptions to report some certain failure scenarios to clients.


How to run a Raft group

In order to run a Raft group (i.e., Raft cluster) using MicroRaft, we need to:

  • implement the RaftEndpoint interface to identify RaftNodes we are going to run,

  • provide a StateMachine implementation for the actual state machine logic (key-value store, atomic register, etc.),

  • (optional) implement the RaftModel and RaftModelFactory interfaces to create Raft RPC request / response objects and Raft log entries, or simply use the default POJO-style implementation of these interfaces available under the io.microraft.model.impl package,

  • provide a Transport implementation to realize serialization of RaftModel objects and networking,

  • (optional) provide a RaftStore implementation if we want to restore crashed Raft nodes. We can persist the internal Raft node state to stable storage via the RaftStore interface and recover from Raft node crashes by restoring persisted Raft state. Otherwise, we could use the already-existing NopRaftStore utility which makes the internal Raft state volatile and disables crash-recovery. If we don't implement persistence, crashed Raft nodes cannot be restarted and need to be removed from the Raft group to not to damage availability. Note that the lack of persistence limits the overall fault tolerance capabilities of Raft groups. Please refer to the Resiliency and Fault Tolerance section to learn more about MicroRaft's fault tolerance capabilities.

  • build a discovery and RPC mechanism to replicate and commit operations on a Raft group. This is required because simplicity is the primary concern for MicroRaft's design philosophy. MicroRaft offers a minimum API set to cover the fundamental functionality and enables its users to implement higher-level abstractions, such as an RPC system with request routing, retries, and deduplication. For instance, MicroRaft neither broadcasts the Raft group members to any external discovery system, nor integrates with any observability tool, but it exposes all necessary information, such as Raft group members, leader Raft endpoint, term, commit index, via RaftNodeReport, which can be accessed via RaftNode.getReport() and RaftNodeReportListener. We can use that information to feed our discovery services and monitoring tools. Similarly, the public APIs on RaftNode do not employ request routing or retry mechanisms. For instance, if a client tries to replicate an operation via a follower or a candidate RaftNode, RaftNode responds back with a NotLeaderException, instead of internally forwarding the operation to the leader RaftNode. NotLeaderException contains RaftEndpoint of the current leader RaftNode so that clients can send their requests to it.

In the next section, we will build an atomic register on top of MicroRaft to demonstrate how to implement and use MicroRaft's main abstractions.


Architectural overview of a Raft group

The following figure depicts an architectural overview of a Raft group based on the main abstractions explained above. Clients talk to the leader RaftNode for replicating operations. They can talk to both the leader and follower RaftNodes for running queries with different consistency guarantees. RaftNode uses RaftModelFactory to create Raft log entries, snapshot entries, and Raft RPC request and response objects. Each RaftNode uses its Transport object to communicate with the other RaftNodes. It also uses RaftStore to persist Raft log entries and snapshots to stable storage. Last, once a log entry is committed, i.e, it is successfully replicated to the majority of the Raft group, its operation is passed to StateMachine for execution, and output of the execution is returned to the client by the leader RaftNode.

Architectural overview of a Raft group


What is next?

In the next section, we will build an atomic register on top of MicroRaft.