diff --git a/neps/nep-0568.md b/neps/nep-0568.md index 9018649dd..9f536e9bc 100644 --- a/neps/nep-0568.md +++ b/neps/nep-0568.md @@ -63,8 +63,8 @@ post-processing, as long as the chain's view reflects a fully resharded state. stateless validation mechanisms. * State Sync: Nodes must be able to sync the states of the child shards post-resharding. -* Cross-Shard Traffic: Receipts and buffered receipts sent to the parent shard may - need to be reassigned to one of the child shards. +* Cross-Shard Traffic: Receipts sent to the parent shard may need to be + reassigned to one of the child shards. * Receipt Handling: Delayed, postponed, buffered, and promise-yield receipts must be correctly distributed between the child shards. * ShardId Semantics: The shard identifiers will become abstract identifiers @@ -76,6 +76,8 @@ post-processing, as long as the chain's view reflects a fully resharded state. MemTrie is the in-memory representation of the trie that the runtime uses for all trie accesses. This is kept in sync with the Trie representation in state. +As of today it isn't mandatory for nodes to have MemTrie feature enabled but going forward, with ReshardingV3, all nodes would require to have MemTrie enabled for resharding to happen successfully. + For the purposes of resharding, we need an efficient way to split the MemTrie into two child tries based on the boundary account. This splitting happens at the epoch boundary when the new epoch is expected to have the two child shards. The set of requirements around MemTrie splitting are: * MemTrie splitting needs to be "instant", i.e. happen efficiently within the span of one block. The child tries need to be available for the processing of the next block in the new epoch. * MemTrie splitting needs to be compatible with stateless validation, i.e. we need to generate a proof that the memtrie split proposed by the chunk producer is correct. @@ -83,11 +85,11 @@ For the purposes of resharding, we need an efficient way to split the MemTrie in With ReshardingV3 design, there's no protocol change to the structure of MemTries, however the implementation constraints required us to introduce the concept of a Frozen MemTrie. More details are in the [implementation](#state-storage---memtrie-1) section below. -Based on the requirements above, we came up with an algorithm to efficiently split the parent trie into two child tries. Trie entries can be divided into three categories based on whether the trie keys have an account_id prefix and based on the total number of such trie keys. Splitting of these keys are handled in different ways. +Based on the requirements above, we came up with an algorithm to efficiently split the parent trie into two child tries. Trie entries can be divided into three categories based on whether the trie keys have an `account_id` prefix and based on the total number of such trie keys. Splitting of these keys is handled in different ways. #### TrieKey with AccountID prefix -This category includes most of the trie keys like `TrieKey::Account`, `TrieKey::ContractCode`, `TrieKey::PostponedReceipt`. For these keys, we can efficiently split the trie based on the boundary account trie key. In the example below, "pass" was the split key, note that we only need to read all the intermediate nodes that form a part of the split key and nothing more. The accessed nodes form a part of the state witness. This limits the size of the witness to effectively O(depth) of trie. +This category includes most of the trie keys like `TrieKey::Account`, `TrieKey::ContractCode`, `TrieKey::PostponedReceipt`, etc. For these keys, we can efficiently split the trie based on the boundary account trie key. Note that we only need to read all the intermediate nodes that form a part of the split key. In the example below, if "pass" is the split key, we access all the nodes along the path of `root` -> `p` -> `a` -> `s` -> `s`, while not needing to touch any of the other intermediate nodes like `o` -> `s` -> `t` in key "post". The accessed nodes form a part of the state witness as those are the only nodes that the validators would need to verify that the resharding split is correct. This limits the size of the witness to effectively O(depth) of trie for each trie key in this category. ![Splitting Trie diagram](assets/nep-0568/NEP-SplitState.png) @@ -203,11 +205,16 @@ The solution to this problem was to introduce the concept of Frozen MemTrie (wit Along with `FrozenArena`, we also introduce a `HybridArena` which is effectively a base made of `FrozenArena` with a top layer of `STArena` where we support allocating and deallocating new nodes into the MemTrie. Newly allocated nodes can reference/point to nodes in the `FrozenArena`. We use this Hybrid MemTrie as a temporary MemTrie while the flat storage is being constructed in the background. +While Frozen MemTries provide the benefits of being compatible with instant resharding, they come at the cost of memory consumption. Once a MemTrie is frozen, since it doesn't support deallocation of memory, it continues to consume as much memory as it did at the time of freezing. In case a node is tracking only one of the child shards, a Frozen MemTrie would continue to use the same amount of memory as the parent trie. Due to this, Hybrid MemTries are only a temporary solution and we rebuild the MemTrie for the children once the post-processing step for Flat Storage is completed. + +Additionally, a node would have to support 2x the memory footprint of a single trie as after resharding, we would have two copies of the trie in memory, one from the temporary Hybrid MemTrie in use for block production, and other from the background MemTrie that would be under construction. Once the background MemTrie is fully constructed and caught up with the latest block, we do an in-place swap of the Hybrid MemTrie with the new child MemTrie and deallocate the memory from the Hybrid MemTrie. + During a resharding event, at the boundary of the epoch, when we need to split the parent shard into the two child shards, we do the following steps: 1. Freeze the parent MemTrie arena to create a read-only frozen arena that represents a snapshot of the state as of the time of freezing, i.e. after postprocessing last block of epoch. Note that we no longer require the parent MemTrie in runtime going forward. 2. We cheaply clone the Frozen MemTrie for both the child MemTries to use. Note that this doesn't clone the parent arena memory, but just increases the refcount. 3. We then create a new MemTrie with HybridArena for each of the children. The base of the MemTrie is the read-only FrozenArena while all new node allocations happens on a dedicated STArena memory pool for each child MemTrie. This is the temporary MemTrie that we use while Flat Storage is being built in the background. -4. Once the Flat Storage is constructed in the post processing step of resharding, we use that to load a new MemTrie and discard the Hybrid MemTrie. +4. Once the Flat Storage is constructed in the post processing step of resharding, we use that to load a new MemTrie and catchup to the latest block. +5. After the new child MemTrie has caught up to the latest block, we do an in-place swap in Client and discard the Hybrid MemTrie. ![Hybrid MemTrie diagram](assets/nep-0568/NEP-HybridMemTrie.png)