diff --git a/openraft/src/docs/docs.md b/openraft/src/docs/docs.md index a7d1bf442..d1abf483e 100644 --- a/openraft/src/docs/docs.md +++ b/openraft/src/docs/docs.md @@ -27,8 +27,9 @@ To learn about the data structures used in Openraft and the commit protocol, see - [`Effective membership`](`crate::docs::data::effective_membership`) explains when membership config takes effect; - [`protocol`](crate::docs::protocol) : - [`replication`](`crate::docs::protocol::replication`); - - [`leader_lease`](`crate::docs::protocol::replication::leader_lease`); - - [`log_replication`](`crate::docs::protocol::replication::log_replication`); + - [`leader_lease`](`crate::docs::protocol::replication::leader_lease`) outlines the leader validity criteria for Leaders and Followers; + - [`log_replication`](`crate::docs::protocol::replication::log_replication`) provides an overview of the general replication protocol; + - [`log_stream`](`crate::docs::protocol::replication::log_stream`) discusses the core aspects of streaming log replication; - [`snapshot_replication`](`crate::docs::protocol::replication::snapshot_replication`); Contributors who want to understand the internals of Openraft can find relevant information in diff --git a/openraft/src/docs/protocol/log_stream.md b/openraft/src/docs/protocol/log_stream.md new file mode 100644 index 000000000..42f3c324f --- /dev/null +++ b/openraft/src/docs/protocol/log_stream.md @@ -0,0 +1,81 @@ +# Ensuring Consecutive Raft Log Replication + +To stream log entries to a remote node, +log entries are read in more than one local IO operations. +There are issues to address to provide consecutivity on the remote end. + +In a stream log replication protocol, Raft logs are read through multiple local +IO operations. +To ensure the integrity of log replication to remote nodes, addressing potential +disruptions in log entry sequence is crucial. + + +## Problem Description + +In a Raft-based system, there is no guarantee that sequential read-log +operations, such as `read log[i]` and `read log[i+1]`, will yield consecutive +log entries. The reason for this is that Raft logs can be truncated, which +disrupts the continuity of the log sequence. Consider the following scenario: + +- Node-1, as a leader, reads `log[4]`. +- Subsequently, Node-1 receives a higher term vote and demotes itself. +- In the interim, a new leader emerges and truncates the logs on Node-1. +- Node-1 regains leadership, appends new logs, and attempts to read `log[5]`. + +Now, the read operation for `log[5]` becomes invalid because the log at index 6, +term 5 (`6-5`) is no longer a direct successor to the log at index 4, term 4 +(`4-4`). This violates Raft's log continuity requirement for replication. + +To illustrate: + +```text +i-j: Raft log at term i and index j + +Node-1: | vote(term=4) + | 2-3 2-4 + ^ + '--- Read log[4] + +Node-1: | vote(term=5) // Higher term vote received + | 2-3 2-4 + +Node-1: | vote(term=5) // Logs are truncated, 4-4 is replicated + | 2-3 4-4 + ^ + '--- Truncation and replication by another leader + +Node-1: | vote(term=6) // Node-1 becomes leader again + | 2-3 4-4 6-5 // Append noop log + +Node-1: | vote(term=6) + | 2-3 4-4 6-5 + ^ + '--- Attempt to read log[5] +``` + + +## Solution + +To ensure consecutive log entries for safe replication, it is necessary to +verify that the term has not changed between log reads. The following updated +operations are proposed: + +1. `read vote` (returns `v1`) +2. `read log[i]` +3. `read log[i+1]` +4. `read vote` (returns `v2`) + +If `v1` is equal to `v2`, we can confidently say that `log[i]` and `log[i+1]` +are consecutive and the replication is safe. Therefore, `ReplicationCore` must +check the term (vote) as part of its replication process to ensure log +consecutivity. + +## Conclusion + +The current release(upto 0.9) mitigates this issue by immediately halting communication +with `ReplicationCore` to prevent any new replication commands from being +processed. + +However, future iterations of `ReplicationCore` may operate more proactively. +To maintain the integrity of replication, `ReplicationCore` +must ensure the consecutivity in the above method. diff --git a/openraft/src/docs/protocol/mod.rs b/openraft/src/docs/protocol/mod.rs index 9b88cb02e..964d4252a 100644 --- a/openraft/src/docs/protocol/mod.rs +++ b/openraft/src/docs/protocol/mod.rs @@ -15,6 +15,10 @@ pub mod replication { #![doc = include_str!("log_replication.md")] } + pub mod log_stream { + #![doc = include_str!("log_stream.md")] + } + pub mod snapshot_replication { #![doc = include_str!("snapshot_replication.md")] } diff --git a/openraft/src/storage/mod.rs b/openraft/src/storage/mod.rs index dbb10c6b8..fd3ffa7a4 100644 --- a/openraft/src/storage/mod.rs +++ b/openraft/src/storage/mod.rs @@ -122,7 +122,10 @@ pub struct LogState { /// A trait defining the interface for a Raft log subsystem. /// -/// This interface is accessed read-only from replica streams. +/// This interface is accessed read-only by replication sub task: `ReplicationCore`. +/// +/// A log reader must also be able to read the last saved vote by [`RaftLogStorage::save_vote`], +/// See: [log-stream](`crate::docs::protocol::replication::log_stream`). /// /// Typically, the log reader implementation as such will be hidden behind an `Arc` and /// this interface implemented on the `Arc`. It can be co-implemented with [`RaftLogStorage`] @@ -133,16 +136,23 @@ where C: RaftTypeConfig { /// Get a series of log entries from storage. /// - /// The start value is inclusive in the search and the stop value is non-inclusive: `[start, - /// stop)`. + /// ### Correctness requirements + /// + /// - The absence of an entry is tolerated only at the beginning or end of the range. Missing + /// entries within the range (i.e., holes) are not permitted and should result in a + /// `StorageError`. /// - /// Entry that is not found is allowed. + /// - The read operation must be transactional. That is, it should not reflect any state changes + /// that occur after the read operation has commenced. async fn try_get_log_entries + Clone + Debug + OptionalSend>( &mut self, range: RB, ) -> Result, StorageError>; /// Return the last saved vote by [`RaftLogStorage::save_vote`]. + /// + /// A log reader must also be able to read the last saved vote by [`RaftLogStorage::save_vote`], + /// See: [log-stream](`crate::docs::protocol::replication::log_stream`) async fn read_vote(&mut self) -> Result>, StorageError>; }