Main Abstractions
MicroRaft's main abstractions are listed below.
RaftConfig
RaftConfig contains configuration options related to the
Raft consensus algorithm and MicroRaft's implementation. Please check the
Configuration section for details.
RaftNode
RaftNode runs the Raft consensus algorithm as a member of
a Raft group. A Raft group is a cluster of RaftNode instances that behave as a
replicated state machine. RaftNode contains APIs for replicating operations,
performing queries, applying membership changes in the Raft group, handling Raft
RPCs and responses, etc.
Raft nodes are identified by [group id, node id] pairs. Multiple Raft groups
can run in the same environment, distributed or even in a single JVM process,
and they can be discriminated from each other with unique group ids. A single
JVM process can run multiple Raft nodes that belong to different Raft groups or
even the same Raft group.
RaftNodes execute the Raft consensus algorithm with the Actor
model. In this model, each RaftNode runs in a single-threaded manner. It
uses a RaftNodeExecutor to sequentially handle API calls and RaftMessage
objects sent by other RaftNodes. The communication between RaftNodes are
implemented with the message-passing approach and abstracted away with the
Transport interface.
In addition to these interfaces that abstract away task execution and
networking, RaftNodes use StateMachine objects to execute queries and
committed operations, and RaftStore objects to persist internal Raft state to
stable storage to be able to recover from crashes.
All of these abstractions are explained below.
RaftEndpoint
RaftEndpoint represents an endpoint that participates to
at least one Raft group and executes the Raft consensus algorithm with a
RaftNode.
RaftNode differentiates members of a Raft group with a unique id. MicroRaft
users need to provide a unique id for each RaftEndpoint. Other than this
information, RaftEndpoint implementations can contain custom fields, such as
network addresses and tags, to be utilized by Transport implementations.
RaftRole
RaftRole denotes the roles of RaftNodes as specified in
the Raft consensus algorithm. Currently, MicroRaft implements the main roles
defined in the paper: LEADER, CANDIDATE, and FOLLOWER. Moreover, it also
implements an extension adding non-voting members: LEARNER. The WITNESS role
is not implemented yet, but it's on the roadmap.
RaftNodeStatus
RaftNodeStatus denotes the statuses of a RaftNode during
its own and its Raft group's lifecycle. A RaftNode is in the INITIAL status
when it is created, and moves to the ACTIVE status when it is started with a
RaftNode.start() call. It stays in this status until either a membership
change is triggered in the Raft group, or either the Raft group or Raft node is
terminated.
StateMachine
StateMachine enables users to implement arbitrary
services, such as an atomic register or a key-value store, and execute
operations on them. Currently, MicroRaft supports memory-based state machines
with datasets in the gigabytes.
RaftNode does not deal with the actual logic of committed operations. Once a
given operation is committed with the Raft consensus algorithm, i.e., it is
replicated to the majority of the Raft group, the operation is passed to the
provided StateMachine implementation. It is the StateMachine
implementation's responsibility to ensure deterministic execution of committed
operations.
Since RaftNodeExecutor ensures the thread-safe execution of the tasks
submitted by RaftNodes, which include actual execution of committed
user-supplied operations, StateMachine implementations do not need to be
thread-safe.
RaftModel and RaftModelFactory
RaftModel is the base interface for the objects that hit
network and disk. There are 2 other interfaces extending this interface: BaseLogEntry and RaftMessage. BaseLogEntry is used for representing log
and snapshot entries stored in the Raft log. RaftMessage is used for Raft RPCs
and their responses. Please see the interfaces inside io.microraft.model for more details. In addition, there is
a RaftModelFactory interface for creating RaftModel
objects with the builder pattern.
MicroRaft comes with a default POJO-style implementation of these interfaces
available under the io.microraft.model.impl package. Users of MicroRaft can
define serialization & deserialization strategies for the default implementation
objects, or implement the RaftModel interfaces with a serialization framework,
such as Protocol Buffers.
Transport
Transport is used for communicating Raft nodes with each
other. Transport implementations must be able to serialize RaftMessage objects created by RaftModelFactory. MicroRaft requires a minimum set of
functionality for networking. There are only two methods in this interface, one
for sending a RaftMessage to a RaftEndpoint and another one for checking reachability of
a RaftEndpoint.
RaftStore and RestoredRaftState
RaftStore is used for persisting the internal state of the
Raft consensus algorithm. Its implementations must provide the durability
guarantees defined in the interface.
If a RaftNode crashes, its persisted state could be read back from stable
storage into a RestoredRaftState object and the RaftNode could be
restored back. RestoredRaftState contains all the necessary information to
recover RaftNode instances from crashes.
RaftStore does not
persist internal state of StateMachine implementations. Upon recovery, a
RaftNode starts with an empty state of the state machine, discovers the
current commit index and re-executes all the operations in the Raft log up to
the commit index to re-populate the state machine.
RaftNodeExecutor
RaftNodeExecutor is used by RaftNode to execute the Raft consensus algorithm with the
Actor
model.
A RaftNode runs by submitting tasks to its RaftNodeExecutor. All tasks submitted by a RaftNode must be executed serially, with maintaining the
happens-before relationship, so that the Raft consensus
algorithm and the user-provided state machine logic could be executed without
synchronization.
MicroRaft contains a default implementation, DefaultRaftNodeExecutor. It internally uses a
single-threaded ScheduledExecutorService and should be suitable for most of
the use-cases. Users of MicroRaft can provide their own RaftNodeExecutor implementations if they want to run RaftNodes in their own threading system according to the
rules defined by MicroRaft.
RaftException
RaftException is the base class for Raft-related
exceptions. MicroRaft defines a number of custom exceptions to report some certain failure scenarios
to clients.
How to run a Raft group
In order to run a Raft group (i.e., Raft cluster) using MicroRaft, we need to:
-
implement the
RaftEndpointinterface to identifyRaftNodes we are going to run, -
provide a
StateMachineimplementation for the actual state machine logic (key-value store, atomic register, etc.), -
(optional) implement the
RaftModelandRaftModelFactoryinterfaces to create Raft RPC request / response objects and Raft log entries, or simply use the default POJO-style implementation of these interfaces available under theio.microraft.model.implpackage, -
provide a
Transportimplementation to realize serialization ofRaftModelobjects and networking, -
(optional) provide a
RaftStoreimplementation if we want to restore crashed Raft nodes. We can persist the internal Raft node state to stable storage via theRaftStoreinterface and recover from Raft node crashes by restoring persisted Raft state. Otherwise, we could use the already-existingNopRaftStoreutility which makes the internal Raft state volatile and disables crash-recovery. If we don't implement persistence, crashed Raft nodes cannot be restarted and need to be removed from the Raft group to not to damage availability. Note that the lack of persistence limits the overall fault tolerance capabilities of Raft groups. Please refer to the Resiliency and Fault Tolerance section to learn more about MicroRaft's fault tolerance capabilities. -
build a discovery and RPC
mechanism to replicate and commit operations on a Raft group. This is required
because simplicity is the primary concern for MicroRaft's design philosophy.
MicroRaft offers a minimum API set to cover the fundamental functionality and
enables its users to implement higher-level abstractions, such as an RPC
system with request routing, retries, and deduplication. For instance,
MicroRaft neither broadcasts the Raft group members to any external discovery
system, nor integrates with any observability tool, but it exposes all
necessary information, such as Raft group members, leader Raft endpoint, term,
commit index, via RaftNodeReport, which can be accessed viaRaftNode.getReport()andRaftNodeReportListener. We can use that information to feed our discovery services and monitoring tools. Similarly, the public APIs onRaftNodedo not employ request routing or retry mechanisms. For instance, if a client tries to replicate an operation via a follower or a candidateRaftNode,RaftNoderesponds back with aNotLeaderException, instead of internally forwarding the operation to the leaderRaftNode.NotLeaderExceptioncontainsRaftEndpointof the current leaderRaftNodeso that clients can send their requests to it.
In the next section, we will build an atomic register on top of MicroRaft to demonstrate how to implement and use MicroRaft's main abstractions.
Architectural overview of a Raft group
The following figure depicts an architectural overview of a Raft group based on
the main abstractions explained above. Clients talk to the leader RaftNode for
replicating operations. They can talk to both the leader and follower
RaftNodes for running queries with different consistency guarantees.
RaftNode uses RaftModelFactory to create Raft log entries, snapshot entries,
and Raft RPC request and response objects. Each RaftNode uses its Transport
object to communicate with the other RaftNodes. It also uses RaftStore to
persist Raft log entries and snapshots to stable storage. Last, once a log entry
is committed, i.e, it is successfully replicated to the majority of the Raft
group, its operation is passed to StateMachine for execution, and output of
the execution is returned to the client by the leader RaftNode.

What is next?
In the next section, we will build an atomic register on top of MicroRaft.