Skip to content
Wonguk Jeong edited this page Jun 30, 2022 · 3 revisions

FOROS : Failover ROS Framework

FOROS is an open source ROS2 framework that can be used to provide redundancy for availability-critical nodes. This framework helps eliminate single points of failure in the system by using the RAFT consensus algorithm to easily organize nodes with the same mission into a cluster. This framework can tolerate fail-stop failures equal to the cluster size minus quorum.

Cluster size (N) Quorum (Q = N / 2 + 1) Number of fault tolerant nodes (N - Q)
1 1 0
2 2 0
3 2 1
4 3 1
5 3 2

Key Features

Leader Election

This framework makes it easy to organize nodes with the same mission into a cluster. Also, when a cluster is configured, one active node is automatically elected through consensus-based leader election.

Specifically, nodes created using FOROS can have one of the following states: Follower, Candidate, or Leader, with a default state of Follower. If there is no Leader, the Follower is changed to a Candidate and the Candidate receiving a majority of the votes becomes the Leader. The detailed state transition is like the state machine below.

image

Fortunately, this complex leader election process is all handled within the FOROS framework, and developers only need to consider Active and Standby states. It also provides a mechanism to filter all ROS2 topics sent by Standby nodes and all ROS2 service requests received by Standby nodes.

image

Log Replication

The active node can use the FOROS API to store sequential runtime data in the cluster. This feature allows active nodes to decentralize their data, so that data can be easily restored even if other nodes become active nodes as a result.

The process of storing data is as follows. 1) When the active node requests to store data, it forwards the request to other nodes. 2.1) If the majority of nodes accept the request, the request succeeds. 2.2) If the majority of nodes do not accept the request, the request fails. image

How to use

Node Redundancy

ROS2 applications typically create node instances for messaging. In this case, you can use FOROS's ClusterNode class instead of rclcpp::Node class to create nodes belonging to a specific cluster. Unlike rclcpp::Node that receives a node name as an argument, ClusterNode receives a cluster name, node ID, and IDs of all nodes in the cluster as arguments.

auto node = akit::failover::foros::ClusterNode::make_shared(
    "Test_cluster",                          // Cluster Name
    0,                                       // Node ID
    std::initializer_list<uint32_t>{0, 1, 2} // Node IDs in the given cluster
);

Log Replication

FOROS uses leveldb to manage data internally and provides APIs for committing data, querying data, and registering data change callbacks.

Data Class

Data is managed as a class called Command. The basic usage is as follows.

example: Create 1 byte of data stored with the value 1 and use getter

auto command = akit::failover::foros::Command::make_shared(
    std::initializer_list<uint8_t>{1});
command()->data(); // raw data getter

Data Commit API

Active node can request to save byte-array data by using the commit_command API of ClusterNode class and receive the request result through the callback function.

example: Request to commit 1-byte data with value 1 stored

node->commit_command(
    akit::failover::foros::Command::make_shared(std::initializer_list<uint8_t>{1}), // Create 1 byte of data stored with the value 1
    [&](akit::failover::foros::CommandCommitResponseSharedFuture response_future) { // Response Callback
      auto response = response_future.get();
      if (response->result() == true) { // On success, print data ID and data contents
        RCLCPP_INFO(logger, "commit completed: %lu %d", response->id(), response->command()->data()[0]);
      }
    }
);

Data Query API

Any node can query data of specific ID using get_command API of ClusterNode class.

example: Query data with ID 0

auto command = node->get_command(0);

Demo

Let's check leader election and log replication operation in the environment below through demo.

Cluster Size Quorum Fault Tolerance Node IDs
4 3 1 { 0, 1, 2, 3 }

Leader Election

Let's check the leader election status by launching and shutting down redundant nodes.

Scenario

Step Action # of running nodes Result
1 Launch node 0 1
2 Launch node 1 2
3 Launch node 2 3 Leader elected (node 1)
4 Launch node 3 4
5 Terminate node 1 3 Leader terminated -> Leader re-elected (node 2)
6 Terminate node 2 2 Leader terminated -> Nodes that exceed fault tolerance are shut down -> No leader
7 Launch node 2 3 Leader elected (node 3)

Demo

leader-election-example

Log Replication

Let's set up all redundant nodes to periodically store one byte of data and check the data commit process.

Scenario

Step Action # of running nodes Result Enable data commit
1 Launch node 0 1 X
2 Lacunh node 1 2 X
3 Launch node 2 3 Leader elected (node 1) O
4 Launch node 4 4 O
5 Terminate node 1 3 Leader terminated -> Leader re-elected (node 2) O
6 Launch node 1 4 Sync data not received while node 1 is terminated O

Demo

log-replication-example