RFC: Consensus Protocol for the proposed Network Architecture #40

shash256 · 2023-07-10T13:15:03Z

shash256
Jul 10, 2023

Goal: The goal is to propose a consensus protocol for the decentralized XMTP network that extends the conversation data layer through Merkle CRDT replication. The consensus protocol should ensure liveness, consistency, and overall reliability in the network by incorporating economic and non-economic incentives for the nodes to carry out their responsibilities.

Purpose: This post outlines detailed specifications of the consensus protocol for XMTP, and builds upon the Delegated Proof-of-Stake consensus mechanism idea introduced in the Network Architecture proposal.

Pre-requisite: The Consensus Protocol assumes that topic replication (message gossip) is already handled at the Data Layer.

Responsibilities of Nodes

The consensus mechanism and DPOS rewards are designed to incentivize nodes to carry out their responsibilities as stated below.

Active participation in replication.
Reliable relaying of messages from connected clients.
Validation of messages.
Maintaining consistent connectivity, avoiding prolonged periods of idleness or disconnection.
Honest construction of blocks and voting.
Abstain from network attacks such as takeovers.
Storing data for the designated retention period.
Managing network congestion and recovering from continuous consensus failures.

Terminology

Cut-off Time: The designated time when full nodes participating in consensus and seeking staking rewards are expected to synchronize their Merkle DAG state for consensus purposes.
Consensus Time: The time when the consensus nodes initiates the consensus process, selecting validators for the block cycle and a block proposer. The consensus time typically follows the cut-off time by a specific interval i.e., every Nth external block of the L1/L2 where XMTP state attestation happens.
Block Finalization Time: The time when a block is considered finalized.
Block Time: The time difference between the cut-off times of two consecutive blocks.
Block: A block consists of {topic identifier, DAG Merkle head} pairs for all active topics during the block time.
Validators: A predefined number of full nodes selected for each consensus cycle based on staked+delegated amounts, with some chosen randomly.
Consensus Nodes (CNs): Full nodes that actively participate in consensus by performing the required tasks outlined by the consensus protocol. They prepare themselves to be selected as validators to vote and earn block rewards.

Protocol

DAG Snapshot and Synchronization

At cut-off time, consensus nodes (CNs) prepare for consensus by creating a temporary snapshot of the Merkle DAG for each topic (aka “consensus-DAG”). The block time, a fixed interval, determines when this snapshot occurs across all nodes. This is measured as every Nth block of Ethereum or an L2 where the state attestation of XMTP happens. Relying on an external blockchain gives consistent time across the XMTP network.

To facilitate a clear explanation, let's assume the block time of XMTP is every 50th Ethereum block (roughly 10 minutes) since the genesis. Consequently, nodes create a consensus-DAG copy every 50th block. This synchronization ensures consistency among nodes and serves as a checkpoint to achieve finality.

It's important to note that the DAG synchronization process varies across nodes due to factors like network topology. The synchronization allows nodes to agree on a consistent state while message sending, replication, and relaying continue unhindered—new messages are replicated for the upcoming block without interruption.

Following the consensus-DAG synchronization, nodes will also independently perform live-DAG sync. This process is separate from consensus-DAG synchronization and allows nodes to provide optimal service to clients and users. It's worth mentioning that further engineering optimizations will be explored to efficiently maintain and synchronize these two DAGs.

Block Construction & Validator Set

After the consensus-DAG synchronization, the CNs proceed to construct the block, and make it accessible to the network through an API endpoint.

For instance, the constructed block can be accessed through the URL format: "URL/2345/constructed_block", where "2345" represents the current block number, universally known across the network.

As the consensus time approaches, which is every N+Δ external block where N is the block at cut-off time and Δ is a fixed delta, the CNs run a validator set selection algorithm. They interact with the staking smart contract to get up-to-date stake details for the top staked nodes and an asynchronous random selection of eligible CNs (irrespective of stake) that meet minimum staking requirements. This requires uniform randomness across the network, which is obtained through the block hash of the N+Δ external block as specified above. Another round of deterministic random selection among the validator set is run to designate one node as the block proposer (leader). It is important to note that participation in consensus is not mandatory for full nodes. Nodes that have not completed synchronization by the consensus time can choose to not participate in that particular round. To ensure an active leader node exists, the validators check the API endpoint of the constructed block for the selected leader. If it is not active, an alternate node is chosen as the leader.

The inclusion of a small number of random nodes in each validation cycle incentivizes all full nodes to actively participate in consensus, regardless of their stake. This encourages frequent synchronization in the hope of being selected as a leader or being able to vote when selected as a validator. Additionally, better performance metrics during consensus allow nodes with lower stake to demonstrate their capabilities to delegators, attracting more delegation and increasing the chance of becoming a top-staked validator and earning higher commission from delegation. The total number of validators will be determined based on the metrics from initial implementation and will be a governance parameter for future updates.

Voting & Commit-Reveal

Validators access the constructed block endpoint of the chosen leader node. If the constructed block matches their own block, they vote "yes" and make their vote available through the vote endpoint, following a commit-reveal scheme. A commit-reveal scheme is required to prevent nodes from simply voting the same as the majority they observe from the public endpoints of other validators.

The commit-reveal scheme obfuscates voting information until it needs to be revealed on-chain. It involves hashing the data with a random salt at the commit phase and storing it on-chain (or at the Oracle) as an attestation of the original data. Eventually, the original data is disclosed on-chain along with the random salt, ensuring transparency and security. This ensures that unless there is collusion, the revealed information remains confidential until the designated disclosure time. The validators that do not reveal the information will be considered as abstained and will not be eligible for rewards.

In the event of an inconsistency, CNs identify the topic(s) with the DAG head inconsistency or determine which topics are missing or in excess. They then proceed to sync again with different nodes on those specific topic(s) to rectify the issue. If a CN discovers that its own copy of the data is at fault during synchronization, it updates its copy accordingly.

This approach allows CNs to identify misbehaving nodes. They can choose to block such nodes instantly, or when they observe repeated faulty data, ensuring consensus integrity. If, even after reasonable synchronization, a validator verifies that its own copy of the data is correct, it can infer that the leader is at fault and vote "no", accordingly.

Vote Aggregation

The staking smart contract collects aggregated votes from all validator node endpoints, either via Oracle or by nodes submitting inputs to the smart contract directly through transactions. The smart contract verifies if a supermajority, defined as 2/3rd or 66.67% of "yes" votes, is reached.

If supermajority (inconclusive vote) is not achieved, the staking smart contract initiates another round of consensus for the same block, with a different leader and a different random selection of nodes. In the event of an inconclusive vote, the staking smart contract penalizes the leader and triggers another round of voting. The penalty deters faulty or malicious leaders from significantly draining network resources. This approach allows CNs to carry out honest and efficient work in block construction and synchronization, while still providing an option to opt-out from publishing the constructed block if they encounter any issues.

Nodes that voted differently from the supermajority or those that did not vote at all will not receive any block rewards. This incentivizes all validators to align with the consensus outcome and participate actively in the voting process.

Note - The Oracle approach as an alternative to transaction submission will be explained later in a separate post

Block Finalization

The smart contract proceeds to publish the finalized block. This serves as the single source of truth for that particular block. Nodes that did not participate in the consensus can compare their data against the finalized block and adjust their inconsistent topic syncs accordingly.

The finalized block is typically published using Ethereum's (or an L2's) 'Transaction Input Data' if gas costs permit. Alternatively, a combination of Chainlink, IPFS/Arweave can be utilized for persistence, with the corresponding URL of the block provided in the 'Input Data' field (details are outside the scope of this proposal).

Block rewards are distributed based on the stake-weight of all agreeing supermajority nodes. The rewarded nodes then share these rewards with their respective delegators, according to the predetermined distribution mechanism. This process ensures the dissemination of the finalized block, establishes a verifiable source of truth for all nodes, and enables the fair distribution of block rewards to participating nodes and their delegators.

Incentives for nodes to carry out their responsibilities

Active participation in replication
- In order for a node to successfully earn a block reward, the block needs to be finalized and match what they constructed. To achieve this, nodes need to actively participate in replication and ensure that the contents they have reach other nodes in the network.
- Nodes should also perform timely DAG synchronization without any delays so that they can participate in the consensus and potentially earn rewards.
Reliable relaying of messages from connected clients (first node)
- Nodes must forward messages in order to receive message rewards, as messages that do not reach the block will not release rewards from the pool.
- Additionally, holding out on relaying messages will cause the node's block to look different from other nodes, resulting in no block rewards.
- If a user/client notices that their messages are not being confirmed in the network, they may switch to another client, resulting in a loss of business for the connected node.
- The connected node not only loses the fees for messages sent, but also the fees it could have earned when messages are received by that user on that topic.
Validation of messages
- If a node detects a faulty message being forwarded, it must either immediately or eventually block the node from sending further messages. This way if all nodes block, then no one gets the messages from this node and no one sends it either. It remains isolated (i.e. it cannot do DAG sync and hence cannot take part in consensus). Also clients abandon this node when they notice their messages are not getting to the network. These risks incentivize nodes to validate messages.
- For cases other than self-validation of a message, the nodes will eventually learn of the faulty/misbehavior of such nodes, when they see the block with the leader being different or when they run another DAG sync with other nodes. Even full nodes not participating in consensus will notice discrepancies when they see the finalized block.
Maintaining consistent connectivity, avoiding prolonged periods of idleness or disconnection

Any lags or disconnections for a long time will cause nodes to lose reputation with clients/users, as they can query the finalized block and notice missing items. Additionally, they will be unable to participate in consensus and lose out on block rewards. This ensures the liveness of the nodes.
Honest construction of blocks and voting (for Consensus Nodes)

A Consensus Node can be selected as a leader at any time, and its constructed block can be accessed by validators. Any node attempting to earn rewards simply by voting yes for all blocks and not properly constructing blocks risks penalization when it is selected as a leader. A false vote will cause the Consensus Nodes to lose block rewards. It eventually has to sync with the finalized block to self-validate its consensus-DAG so Consensus Nodes would find it suitable do it before, allowing them to earn the reward.
Abstain from network attacks such as takeovers

Proof of Stake secures the network. An attacker can gain maximum validating power (all top 67% nodes) by buying $XMTP in large quantities and validating fake blocks. However, it comes at the expense of the network. Any inconsistency observed by the community immediately triggers a drop in $XMTP price, causing the attacker to lose the value of their stake.
Storing data for the designated retention period

Nodes can participate in consensus to earn block rewards but not in other activities such as storing messages until the retention period or relaying, while foregoing messaging fees. While these nodes are earning rewards for their work, if we were to encourage storage beyond fees, we propose requiring nodes to attach random old block/topic data, similar to Arweave Proof-of-Access.
Managing network congestion and recovering from continuous consensus failures

The dynamic fee ensures that the network can handle congestion and successfully produce blocks even during consensus failures that can happen due to the inability to reach supermajority because of sync delays arising from congestion.

Enabling verifiability for clients/users:

When publishing the finalized block, the smart contract generates a Merkle tree where the leaves are the Merkle heads of all topics in that block. The block hash is the root of that Merkle tree. If there are multiple heads for a topic DAG, all are included as one leaf.
The block number is considered to be universal truth, even clients know it based on the time of the message sent. After the block is finalized, the client updates the block hash for the corresponding block number on the UI and can also give an identifier for the message state, such as ✔️ for sent, ✔️✔️ for received (read receipts), and ☑️☑️ for finalized and validated.
Clients or end users themselves can request the Merkle DAG of their topic at any time from any of the nodes/explorers, and self-verify what they have. They can also do this for historical states by trimming the DAG to the last message with a certain block number they want to verify and computing the Merkle DAG head for it. They can then verify the Merkle membership of this head in its corresponding block using the block hash and a Merkle Proof.
This provides trust guarantees of correctness for the client and the end user, similar to a blockchain. Malicious nodes can be detected by the clients, and malicious clients can be detected by the users. This avoids malicious clients who try to run a centralized server and lie to their users that they are powered by XMTP.

trebor-yatska · 2023-07-12T14:24:35Z

trebor-yatska
Jul 12, 2023

Thanks for this! I have a few questions.

Per the below, we are effectively allowing nodes to be offline and inactive at unpredictable times. Assuming there is not a deep bench of nodes, wouldn't this leave the network susceptible to finality delays? What downstream effects might this cause (e.g., difficulty wrt recognizing potential node client issues vs. swaths of nodes choosing to be inactive)?

It is important to note that participation in consensus is not mandatory for full nodes. Nodes that have not completed synchronization by the consensus time can choose to not participate in that particular round.

I see you have specifically address the above concern in the incentives section below, but I'm skeptical it's a strong enough incentive to ensure network availability and liveness. Is a small punitive measure necessary?

Any lags or disconnections for a long time will cause nodes to lose reputation with clients/users, as they can query the finalized block and notice missing items. Additionally, they will be unable to participate in consensus and lose out on block rewards. This ensures the liveness of the nodes.

The below bullets imply there are specific fees for nodes providing RPC services for client apps; is this the intent?

If a user/client notices that their messages are not being confirmed in the network, they may switch to another client, resulting in a loss of business for the connected node.
The connected node not only loses the fees for messages sent, but also the fees it could have earned when messages are received by that user on that topic.

How does the dynamic fee mentioned below work?

The dynamic fee ensures that the network can handle congestion and successfully produce blocks even during consensus failures that can happen due to the inability to reach supermajority because of sync delays arising from congestion.

I'm also curious about the live-DAG sync; how does it allow nodes to provide optimal service?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XMTP

RFC: Consensus Protocol for the proposed Network Architecture #40

{{title}}

Replies: 1 comment

{{title}}

Select a reply

XMTP

RFC: Consensus Protocol for the proposed Network Architecture #40

shash256 Jul 10, 2023

Responsibilities of Nodes

Terminology

Protocol

DAG Snapshot and Synchronization

Block Construction & Validator Set

Voting & Commit-Reveal

Vote Aggregation

Block Finalization

Incentives for nodes to carry out their responsibilities

Enabling verifiability for clients/users:

Replies: 1 comment

trebor-yatska Jul 12, 2023

shash256
Jul 10, 2023

trebor-yatska
Jul 12, 2023