Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: lack of recovery methods in case of message consumption failure #94

Open
adu-web3 opened this issue Sep 11, 2024 · 1 comment
Open

Comments

@adu-web3
Copy link
Collaborator

Description

A message could possibly failed to be consumed in the worst cases(revert owing to completely unexpected error), and this would halt the protocol but we lack recovery functionalities, like force consuming a message, to recover from such situation

@adu-web3
Copy link
Collaborator Author

adu-web3 commented Jan 24, 2025

After we ran into nonce mismatch issue and @MaxMustermann2 fixed it with governance role(upgrade the contract to manually write the nonce to expected value), we realized that the current way of handling messages would bring us a lot of difficulty to recover.

Here are some facts:

  1. nonce mismatch is a protocol exception(unexpected behavior) that should not happen, so the message would definitely revert during lzReceive execution, thus the following messages from the same source chain would be blocked
  2. for critical problem like nonce mismatch, they are certainly not recoverable though normal contract interactions.

So @MaxMustermann2 asked why we maintain the nonce inside gateway contract(the app contract in layerzero's context), given layerzero endpoint has maintained the nonce by itself. The answer is that we have to maintain the nonce and check for expected nonce if we want ordered execution of messages(https://docs.layerzero.network/v2/developers/evm/oapp/message-design-patterns#ordered-delivery). Then @MaxMustermann2 realized if any revert happened during the execution of message, the contract would get stuck(being blocked), and this is true, but there are pros and cons:

pros:

  1. easy and good for protocol's soundness, since a withdraw message would not happen before a deposit message as long as they are initiated from source chain correctly
  2. no need to worry about any message would be missing, since any missed message would block the following workflow until we feed in the expected message

cons:

  1. any revert during execution of message would block the following workflow, and it is often difficult to recover from this situation
  2. any missing message would block the following workflow
  3. the throughput of cross-chain messaging might be limited

So we can conclude some best practices based the pros and cons:

  1. try not to revert as far as possible during implementing lzReceive for app contract, as long as the error is possible and would not cause dangers: e.g. withdrawal might fail for many reasons like insufficient balance, and this is expected behavior and would not cause dangers to protocol as long as balance not changed for this case, so inside ExocoreGateway.lzReceive function, we should not revert in this case, instead we should either return an response indicating the source chain that the withdrawal failed, or simply finish the execution but do not change any state and emit an event to tell that the withdrawal has failed(we choose the former approach), and both cases would successfully consume the message's nonce without any revert.
  2. If an operation of a message is retry-able, just let the message being consumed even if it does not change the state as expected, and use event to indicate it is not successful, like the upper withdrawal case and the upgrade logics of Bootstrap contract
  3. only revert on critical errors, like nonce mismatch: if message's nonce is littler than expected nonce, it means this message has been consumed and should never be consumed again, and if message's nonce is greater than expected nonce, it means some messages before it are missing and we should wait for them. So we should only revert on errors that are exceptional, unexpected and critical, because these errors would tell us something truly wrong is happening and we should stop and fix it before letting the protocol continue running

Besides, not all reverts during execution of messages are not recoverable. Actually there are plenty of transient errors that would also result in the revert: like insufficient gas limit, insufficient balance for layerzero relayer, or the gateway contract is paused for some reason, but these errors are typically not logic errors, but just some transient errors that could be resolved: e.g. the relayer could provide more gas limit, or we could unpause the contract to start receiving messages and so on. So when we talk about reverts, we are mostly often talking about reverts caused by business logics and can not be resolved easily.

If we follow these best practices, to only revert on exceptional and critical signals and never revert on expected and retry-able cases, we could make sure the revert would not happen frequently and depend on revert to identify system bugs. But even if the revert happens owing to an exceptional bug, it could be quite cumbersome, especially if we have multi-sig governance, so in the worst cases, the revert happens and it is difficult to recover the contract from bad state, we should have some privilege functions for contract governor to force consuming the message(it's nonce more specifically) to recover the contract from being stuck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant