You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Gravity Contract assigns every event a monotonically increasing event_nonce with no gaps. This nonces is the unique coordinating value for the Oracle. Every event has the event_nonce attached, this is used to ensure that when a validator submits a claim stating it has seen a specific event happening on Ethereum the ordering is unambiguous.
This logic is indeed enforced in the Cosmos SDK module, in the function Attest(), that checks that the event nonces from a particular validator are contiguous:
The function check_for_events() is structured in the following way around processing of event nonces:
pubasyncfncheck_for_events(...,starting_block:Uint256) -> Result<Uint256,GravityError>{let latest_block = get_block_number_with_retry(web3);let latest_block = latest_block - get_block_delay(web3);// Collect events from Ethereum between starting_block and latest_blocklet last_event_nonce = get_last_event_nonce_for_validator(...);// Filter events to be after last_event_noncelet valsets = ValsetUpdatedEvent::filter_by_event_nonce(last_event_nonce,&valsets);// in the same way for other event types// Send filtered events to cosmoslet res = send_ethereum_claims(...).await?;let new_event_nonce = get_last_event_nonce_for_validator(...).await?;if new_event_nonce == last_event_nonce {returnErr(GravityError::InvalidBridgeStateError)}Ok(latest_block)}
As can be seen, a Tendermint block has simultaneously two constraints on its block: the maximum size (200'000 bytes), and the maximum gas (40'000'000 gas) a block can use. As a lot of Ethereum events are accumulated into a single Cosmos transaction, it seems inevitable that at least one of the above limits will be hit eventually. The problem with the code of send_ethereum_claims() and of its upstream call sites (in particular in eth_oracle_main_loop()) is that in case the transaction fails, processing at the current loop iteration simply aborts: no action is taken, besides logging. At the same time, the conditions that caused the error to occur (excessive number of Ethereum events) can't improve: they can only get worse. In that case, the failure will repeat over and over again, without end.
A further, minor issue is the last step: if the new_event_nonce is not equal to last_event_nonce, but, at the same time, not all events that have been observed on Ethereum have been successfully transferred to Cosmos, the events that belong to the block interval between last_checked_block and latest_block, but with event nonce such that last_event_nonce < new_event_nonce < nonce could be lost in case the messages were processed on Cosmos individually. The current Cosmos transaction semantics does process either all messages belonging to a transaction, or none, so this check seems to be safe. It is still advisable to change the check, and compare new_event_nonce to the latest submitted event nonce.
Problem Scenarios
The following scenario is possible:
A large number of unprocessed Ethereum events accumulates: e.g. due to the bridge popularity, or because the relaying was not happening for some time due to network problems
Ethereum oracle tries to relay the events, but due to large number of them hits one of the hard bounds on Tendermint block (either size of gas)
The current loop iteration, as well as all subsequent iterations will fail: this orchestrator will stop to relay Ethereum events
In case a substantial number of orchestrators fail to submit ethereum events, the Gravity bridge will halt, because there will be not enough Ethereum claims to pass attestation.
A particular problem with this issue is that it won't demonstrate itself in simple tests, or even under moderate load. But when the conditions are met (increased bridge popularity, or delayed event relaying), the effects will be sudden and severe.
Recommendation
Fix the logic of ethereum_event_watcher such that no possibility of losing any Ethereum events exists; in particular, when needed, split relayed Ethereum events into multiple Cosmos transactions.
The text was updated successfully, but these errors were encountered:
Original issue
Surfaced from @informalsystems audit of Althea Gravity Bridge at commit 19a4cfe
severity: High
type: Implementation bug
difficulty: Intermediate
Involved artifacts
Description
As outlined in the Ethereum oracle documentation:
This logic is indeed enforced in the Cosmos SDK module, in the function Attest(), that checks that the event nonces from a particular validator are contiguous:
The problematic code lives in the orchestrator's function check_for_events(), which is called from eth_oracle_main_loop() like that:
The function check_for_events() is structured in the following way around processing of event nonces:
Finally, the function send_ethereum_claims() works roughly as follows:
As can be seen, the combination of the above functions works as follows:
last_checked_block
andlatest_block
last_event_nonce
new_event_nonce
, which is the last event nonce observed on Cosmos for this validatorlast_checked_block
tolatest_block
ifnew_event_nonce != last_event_nonce
last_checked_block
andlast_event_nonce
stay unaltered.Querying the
/consensus_params
endpoint of a RPC node Cosmos Hub 4, one part of the response is the following:As can be seen, a Tendermint block has simultaneously two constraints on its block: the maximum size (200'000 bytes), and the maximum gas (40'000'000 gas) a block can use. As a lot of Ethereum events are accumulated into a single Cosmos transaction, it seems inevitable that at least one of the above limits will be hit eventually. The problem with the code of send_ethereum_claims() and of its upstream call sites (in particular in eth_oracle_main_loop()) is that in case the transaction fails, processing at the current loop iteration simply aborts: no action is taken, besides logging. At the same time, the conditions that caused the error to occur (excessive number of Ethereum events) can't improve: they can only get worse. In that case, the failure will repeat over and over again, without end.
A further, minor issue is the last step: if the
new_event_nonce
is not equal tolast_event_nonce
, but, at the same time, not all events that have been observed on Ethereum have been successfully transferred to Cosmos, the events that belong to the block interval betweenlast_checked_block
andlatest_block
, but with eventnonce
such thatlast_event_nonce < new_event_nonce < nonce
could be lost in case the messages were processed on Cosmos individually. The current Cosmos transaction semantics does process either all messages belonging to a transaction, or none, so this check seems to be safe. It is still advisable to change the check, and comparenew_event_nonce
to the latest submitted event nonce.Problem Scenarios
The following scenario is possible:
A particular problem with this issue is that it won't demonstrate itself in simple tests, or even under moderate load. But when the conditions are met (increased bridge popularity, or delayed event relaying), the effects will be sudden and severe.
Recommendation
Fix the logic of ethereum_event_watcher such that no possibility of losing any Ethereum events exists; in particular, when needed, split relayed Ethereum events into multiple Cosmos transactions.
The text was updated successfully, but these errors were encountered: