feat: lack of recovery methods in case of message consumption failure #94

adu-web3 · 2024-09-11T06:53:48Z

Description

A message could possibly failed to be consumed in the worst cases(revert owing to completely unexpected error), and this would halt the protocol but we lack recovery functionalities, like force consuming a message, to recover from such situation

adu-web3 · 2025-01-24T12:50:17Z

After we ran into nonce mismatch issue and @MaxMustermann2 fixed it with governance role(upgrade the contract to manually write the nonce to expected value), we realized that the current way of handling messages would bring us a lot of difficulty to recover.

Here are some facts:

nonce mismatch is a protocol exception(unexpected behavior) that should not happen, so the message would definitely revert during lzReceive execution, thus the following messages from the same source chain would be blocked
for critical problem like nonce mismatch, they are certainly not recoverable though normal contract interactions.

So @MaxMustermann2 asked why we maintain the nonce inside gateway contract(the app contract in layerzero's context), given layerzero endpoint has maintained the nonce by itself. The answer is that we have to maintain the nonce and check for expected nonce if we want ordered execution of messages(https://docs.layerzero.network/v2/developers/evm/oapp/message-design-patterns#ordered-delivery). Then @MaxMustermann2 realized if any revert happened during the execution of message, the contract would get stuck(being blocked), and this is true, but there are pros and cons:

pros:

easy and good for protocol's soundness, since a withdraw message would not happen before a deposit message as long as they are initiated from source chain correctly
no need to worry about any message would be missing, since any missed message would block the following workflow until we feed in the expected message

cons:

any revert during execution of message would block the following workflow, and it is often difficult to recover from this situation
any missing message would block the following workflow
the throughput of cross-chain messaging might be limited

So we can conclude some best practices based the pros and cons:

try not to revert as far as possible during implementing lzReceive for app contract, as long as the error is possible and would not cause dangers: e.g. withdrawal might fail for many reasons like insufficient balance, and this is expected behavior and would not cause dangers to protocol as long as balance not changed for this case, so inside ExocoreGateway.lzReceive function, we should not revert in this case, instead we should either return an response indicating the source chain that the withdrawal failed, or simply finish the execution but do not change any state and emit an event to tell that the withdrawal has failed(we choose the former approach), and both cases would successfully consume the message's nonce without any revert.
If an operation of a message is retry-able, just let the message being consumed even if it does not change the state as expected, and use event to indicate it is not successful, like the upper withdrawal case and the upgrade logics of Bootstrap contract
only revert on critical errors, like nonce mismatch: if message's nonce is littler than expected nonce, it means this message has been consumed and should never be consumed again, and if message's nonce is greater than expected nonce, it means some messages before it are missing and we should wait for them. So we should only revert on errors that are exceptional, unexpected and critical, because these errors would tell us something truly wrong is happening and we should stop and fix it before letting the protocol continue running

Besides, not all reverts during execution of messages are not recoverable. Actually there are plenty of transient errors that would also result in the revert: like insufficient gas limit, insufficient balance for layerzero relayer, or the gateway contract is paused for some reason, but these errors are typically not logic errors, but just some transient errors that could be resolved: e.g. the relayer could provide more gas limit, or we could unpause the contract to start receiving messages and so on. So when we talk about reverts, we are mostly often talking about reverts caused by business logics and can not be resolved easily.

If we follow these best practices, to only revert on exceptional and critical signals and never revert on expected and retry-able cases, we could make sure the revert would not happen frequently and depend on revert to identify system bugs. But even if the revert happens owing to an exceptional bug, it could be quite cumbersome, especially if we have multi-sig governance, so in the worst cases, the revert happens and it is difficult to recover the contract from bad state, we should have some privilege functions for contract governor to force consuming the message(it's nonce more specifically) to recover the contract from being stuck.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: lack of recovery methods in case of message consumption failure #94

feat: lack of recovery methods in case of message consumption failure #94

adu-web3 commented Sep 11, 2024

adu-web3 commented Jan 24, 2025 •

edited

Loading

feat: lack of recovery methods in case of message consumption failure #94

feat: lack of recovery methods in case of message consumption failure #94

Comments

adu-web3 commented Sep 11, 2024

Description

adu-web3 commented Jan 24, 2025 • edited Loading

adu-web3 commented Jan 24, 2025 •

edited

Loading