Main Abstractions
MicroRaft's main abstractions are listed below.
RaftConfig
RaftConfig
contains configuration options related to the
Raft consensus algorithm and MicroRaft's implementation. Please check the
Configuration section for details.
RaftNode
RaftNode
runs the Raft consensus algorithm as a member of
a Raft group. A Raft group is a cluster of RaftNode
instances that behave as a
replicated state machine. RaftNode
contains APIs for replicating operations,
performing queries, applying membership changes in the Raft group, handling Raft
RPCs and responses, etc.
Raft nodes are identified by [group id, node id]
pairs. Multiple Raft groups
can run in the same environment, distributed or even in a single JVM process,
and they can be discriminated from each other with unique group ids. A single
JVM process can run multiple Raft nodes that belong to different Raft groups or
even the same Raft group.
RaftNode
s execute the Raft consensus algorithm with the Actor
model. In this model, each RaftNode
runs in a single-threaded manner. It
uses a RaftNodeExecutor
to sequentially handle API calls and RaftMessage
objects sent by other RaftNode
s. The communication between RaftNode
s are
implemented with the message-passing approach and abstracted away with the
Transport
interface.
In addition to these interfaces that abstract away task execution and
networking, RaftNode
s use StateMachine
objects to execute queries and
committed operations, and RaftStore
objects to persist internal Raft state to
stable storage to be able to recover from crashes.
All of these abstractions are explained below.
RaftEndpoint
RaftEndpoint
represents an endpoint that participates to
at least one Raft group and executes the Raft consensus algorithm with a
RaftNode
.
RaftNode
differentiates members of a Raft group with a unique id. MicroRaft
users need to provide a unique id for each RaftEndpoint
. Other than this
information, RaftEndpoint
implementations can contain custom fields, such as
network addresses and tags, to be utilized by Transport
implementations.
RaftRole
RaftRole
denotes the roles of RaftNode
s as specified in
the Raft consensus algorithm. Currently, MicroRaft implements the main roles
defined in the paper: LEADER
, CANDIDATE
, and FOLLOWER
. Moreover, it also
implements an extension adding non-voting members: LEARNER
. The WITNESS
role
is not implemented yet, but it's on the roadmap.
RaftNodeStatus
RaftNodeStatus
denotes the statuses of a RaftNode
during
its own and its Raft group's lifecycle. A RaftNode
is in the INITIAL
status
when it is created, and moves to the ACTIVE
status when it is started with a
RaftNode.start()
call. It stays in this status until either a membership
change is triggered in the Raft group, or either the Raft group or Raft node is
terminated.
StateMachine
StateMachine
enables users to implement arbitrary
services, such as an atomic register or a key-value store, and execute
operations on them. Currently, MicroRaft supports memory-based state machines
with datasets in the gigabytes.
RaftNode
does not deal with the actual logic of committed operations. Once a
given operation is committed with the Raft consensus algorithm, i.e., it is
replicated to the majority of the Raft group, the operation is passed to the
provided StateMachine
implementation. It is the StateMachine
implementation's responsibility to ensure deterministic execution of committed
operations.
Since RaftNodeExecutor
ensures the thread-safe execution of the tasks
submitted by RaftNode
s, which include actual execution of committed
user-supplied operations, StateMachine
implementations do not need to be
thread-safe.
RaftModel
and RaftModelFactory
RaftModel
is the base interface for the objects that hit
network and disk. There are 2 other interfaces extending this interface: BaseLogEntry
and RaftMessage
. BaseLogEntry
is used for representing log
and snapshot entries stored in the Raft log. RaftMessage
is used for Raft RPCs
and their responses. Please see the interfaces inside io.microraft.model
for more details. In addition, there is
a RaftModelFactory
interface for creating RaftModel
objects with the builder pattern.
MicroRaft comes with a default POJO-style implementation of these interfaces
available under the io.microraft.model.impl
package. Users of MicroRaft can
define serialization & deserialization strategies for the default implementation
objects, or implement the RaftModel
interfaces with a serialization framework,
such as Protocol Buffers.
Transport
Transport
is used for communicating Raft nodes with each
other. Transport
implementations must be able to serialize RaftMessage
objects created by RaftModelFactory
. MicroRaft requires a minimum set of
functionality for networking. There are only two methods in this interface, one
for sending a RaftMessage
to a RaftEndpoint
and another one for checking reachability of
a RaftEndpoint
.
RaftStore
and RestoredRaftState
RaftStore
is used for persisting the internal state of the
Raft consensus algorithm. Its implementations must provide the durability
guarantees defined in the interface.
If a RaftNode
crashes, its persisted state could be read back from stable
storage into a RestoredRaftState
object and the RaftNode
could be
restored back. RestoredRaftState
contains all the necessary information to
recover RaftNode
instances from crashes.
RaftStore
does not
persist internal state of StateMachine
implementations. Upon recovery, a
RaftNode
starts with an empty state of the state machine, discovers the
current commit index and re-executes all the operations in the Raft log up to
the commit index to re-populate the state machine.
RaftNodeExecutor
RaftNodeExecutor
is used by RaftNode
to execute the Raft consensus algorithm with the
Actor
model.
A RaftNode
runs by submitting tasks to its RaftNodeExecutor
. All tasks submitted by a RaftNode
must be executed serially, with maintaining the
happens-before relationship, so that the Raft consensus
algorithm and the user-provided state machine logic could be executed without
synchronization.
MicroRaft contains a default implementation, DefaultRaftNodeExecutor
. It internally uses a
single-threaded ScheduledExecutorService
and should be suitable for most of
the use-cases. Users of MicroRaft can provide their own RaftNodeExecutor
implementations if they want to run RaftNode
s in their own threading system according to the
rules defined by MicroRaft.
RaftException
RaftException
is the base class for Raft-related
exceptions. MicroRaft defines a number of custom exceptions to report some certain failure scenarios
to clients.
How to run a Raft group
In order to run a Raft group (i.e., Raft cluster) using MicroRaft, we need to:
-
implement the
RaftEndpoint
interface to identifyRaftNode
s we are going to run, -
provide a
StateMachine
implementation for the actual state machine logic (key-value store, atomic register, etc.), -
(optional) implement the
RaftModel
andRaftModelFactory
interfaces to create Raft RPC request / response objects and Raft log entries, or simply use the default POJO-style implementation of these interfaces available under theio.microraft.model.impl
package, -
provide a
Transport
implementation to realize serialization ofRaftModel
objects and networking, -
(optional) provide a
RaftStore
implementation if we want to restore crashed Raft nodes. We can persist the internal Raft node state to stable storage via theRaftStore
interface and recover from Raft node crashes by restoring persisted Raft state. Otherwise, we could use the already-existingNopRaftStore
utility which makes the internal Raft state volatile and disables crash-recovery. If we don't implement persistence, crashed Raft nodes cannot be restarted and need to be removed from the Raft group to not to damage availability. Note that the lack of persistence limits the overall fault tolerance capabilities of Raft groups. Please refer to the Resiliency and Fault Tolerance section to learn more about MicroRaft's fault tolerance capabilities. -
build a discovery and RPC mechanism to replicate and commit operations on a Raft group. This is required because simplicity is the primary concern for MicroRaft's design philosophy. MicroRaft offers a minimum API set to cover the fundamental functionality and enables its users to implement higher-level abstractions, such as an RPC system with request routing, retries, and deduplication. For instance, MicroRaft neither broadcasts the Raft group members to any external discovery system, nor integrates with any observability tool, but it exposes all necessary information, such as Raft group members, leader Raft endpoint, term, commit index, via
RaftNodeReport
, which can be accessed viaRaftNode.getReport()
andRaftNodeReportListener
. We can use that information to feed our discovery services and monitoring tools. Similarly, the public APIs onRaftNode
do not employ request routing or retry mechanisms. For instance, if a client tries to replicate an operation via a follower or a candidateRaftNode
,RaftNode
responds back with aNotLeaderException
, instead of internally forwarding the operation to the leaderRaftNode
.NotLeaderException
containsRaftEndpoint
of the current leaderRaftNode
so that clients can send their requests to it.
In the next section, we will build an atomic register on top of MicroRaft to demonstrate how to implement and use MicroRaft's main abstractions.
Architectural overview of a Raft group
The following figure depicts an architectural overview of a Raft group based on
the main abstractions explained above. Clients talk to the leader RaftNode
for
replicating operations. They can talk to both the leader and follower
RaftNode
s for running queries with different consistency guarantees.
RaftNode
uses RaftModelFactory
to create Raft log entries, snapshot entries,
and Raft RPC request and response objects. Each RaftNode
uses its Transport
object to communicate with the other RaftNode
s. It also uses RaftStore
to
persist Raft log entries and snapshots to stable storage. Last, once a log entry
is committed, i.e, it is successfully replicated to the majority of the Raft
group, its operation is passed to StateMachine
for execution, and output of
the execution is returned to the client by the leader RaftNode
.
What is next?
In the next section, we will build an atomic register on top of MicroRaft.