Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ethereum event relaying may saturate Cosmos transactions #281

Closed
EricBolten opened this issue Nov 9, 2021 · 1 comment
Closed

Ethereum event relaying may saturate Cosmos transactions #281

EricBolten opened this issue Nov 9, 2021 · 1 comment

Comments

@EricBolten
Copy link
Contributor

Original issue

Surfaced from @informalsystems audit of Althea Gravity Bridge at commit 19a4cfe

severity: High
type: Implementation bug
difficulty: Intermediate

Involved artifacts

Description

As outlined in the Ethereum oracle documentation:

The Gravity Contract assigns every event a monotonically increasing event_nonce with no gaps. This nonces is the unique coordinating value for the Oracle. Every event has the event_nonce attached, this is used to ensure that when a validator submits a claim stating it has seen a specific event happening on Ethereum the ordering is unambiguous.

This logic is indeed enforced in the Cosmos SDK module, in the function Attest(), that checks that the event nonces from a particular validator are contiguous:

lastEventNonce := k.GetLastEventNonceByValidator(ctx, valAddr)
if claim.GetEventNonce() != lastEventNonce+1 {
    return nil, types.ErrNonContiguousEventNonce
}

The problematic code lives in the orchestrator's function check_for_events(), which is called from eth_oracle_main_loop() like that:

        // Relays events from Ethereum -> Cosmos
        match check_for_events(..., last_checked_block.clone())
        .await
        {
            Ok(new_block) => last_checked_block = new_block,
            Err(e) => ...
        }

The function check_for_events() is structured in the following way around processing of event nonces:

pub async fn check_for_events(..., starting_block: Uint256) -> Result<Uint256, GravityError> {
    let latest_block = get_block_number_with_retry(web3);
    let latest_block = latest_block - get_block_delay(web3);

    // Collect events from Ethereum between starting_block and latest_block

    let last_event_nonce = get_last_event_nonce_for_validator(...);

    // Filter events to be after last_event_nonce
    let valsets = ValsetUpdatedEvent::filter_by_event_nonce(last_event_nonce, &valsets);
    // in the same way for other event types

    // Send filtered events to cosmos
    let res = send_ethereum_claims(...).await?;

    let new_event_nonce = get_last_event_nonce_for_validator(...).await?;
    if new_event_nonce == last_event_nonce {
                return Err(GravityError::InvalidBridgeStateError)
    }
    Ok(latest_block)
}

Finally, the function send_ethereum_claims() works roughly as follows:

pub async fn send_ethereum_claims(..., deposits, withdraws, erc20_deploys, valsets,...) {
    let our_address = private_key.to_address(&contact.get_prefix()).unwrap();

    let mut unordered_msgs = HashMap::new();
    for deposit in deposits {
        let claim = ...;
        let msg = Msg::new("/gravity.v1.MsgSendToCosmosClaim", claim);
        unordered_msgs.insert(deposit.event_nonce, msg);
    }
    // same for other types of events
    ...
    // sort according to event nonces
    keys.sort_unstable();
    let mut msgs = Vec::new();
    for i in keys {
        msgs.push(unordered_msgs.remove_entry(&i).unwrap().1);
    }
    let fee = Fee { ...,  gas_limit: 500_000_000u64, ... };
    let args = contact.get_message_args(our_address, fee).await?;
    ...
    let msg_bytes = private_key.sign_std_msg(&msgs, args, MEMO)?;
    let response = contact
        .send_transaction(msg_bytes, BroadcastMode::Sync)
        .await?;
    contact.wait_for_tx(response, TIMEOUT).await
}

As can be seen, the combination of the above functions works as follows:

  • collect Ethereum events between last_checked_block and latest_block
  • filter events and keep only those with nonces greater than last_event_nonce
  • send filtered events to Cosmos; abort if the send was not successful
  • get new_event_nonce, which is the last event nonce observed on Cosmos for this validator
  • update last_checked_block to latest_block if new_event_nonce != last_event_nonce
  • if any step in the above sequence fails, last_checked_block and last_event_nonce stay unaltered.

Querying the /consensus_params endpoint of a RPC node Cosmos Hub 4, one part of the response is the following:

"consensus_params": {
      "block": {
        "max_bytes": "200000",
        "max_gas": "40000000",
        "time_iota_ms": "1000"
      }
}

As can be seen, a Tendermint block has simultaneously two constraints on its block: the maximum size (200'000 bytes), and the maximum gas (40'000'000 gas) a block can use. As a lot of Ethereum events are accumulated into a single Cosmos transaction, it seems inevitable that at least one of the above limits will be hit eventually. The problem with the code of send_ethereum_claims() and of its upstream call sites (in particular in eth_oracle_main_loop()) is that in case the transaction fails, processing at the current loop iteration simply aborts: no action is taken, besides logging. At the same time, the conditions that caused the error to occur (excessive number of Ethereum events) can't improve: they can only get worse. In that case, the failure will repeat over and over again, without end.

A further, minor issue is the last step: if the new_event_nonce is not equal to last_event_nonce, but, at the same time, not all events that have been observed on Ethereum have been successfully transferred to Cosmos, the events that belong to the block interval between last_checked_block and latest_block, but with event nonce such that last_event_nonce < new_event_nonce < nonce could be lost in case the messages were processed on Cosmos individually. The current Cosmos transaction semantics does process either all messages belonging to a transaction, or none, so this check seems to be safe. It is still advisable to change the check, and compare new_event_nonce to the latest submitted event nonce.

Problem Scenarios

The following scenario is possible:

  1. A large number of unprocessed Ethereum events accumulates: e.g. due to the bridge popularity, or because the relaying was not happening for some time due to network problems
  2. Ethereum oracle tries to relay the events, but due to large number of them hits one of the hard bounds on Tendermint block (either size of gas)
  3. The current loop iteration, as well as all subsequent iterations will fail: this orchestrator will stop to relay Ethereum events
  4. In case a substantial number of orchestrators fail to submit ethereum events, the Gravity bridge will halt, because there will be not enough Ethereum claims to pass attestation.

A particular problem with this issue is that it won't demonstrate itself in simple tests, or even under moderate load. But when the conditions are met (increased bridge popularity, or delayed event relaying), the effects will be sudden and severe.

Recommendation

Fix the logic of ethereum_event_watcher such that no possibility of losing any Ethereum events exists; in particular, when needed, split relayed Ethereum events into multiple Cosmos transactions.

@EricBolten
Copy link
Contributor Author

Our implementation sends queued Cosmos messages in a configurable batch size, so we can avoid saturating the Cosmos block size:

while let Some(messages) = rx.recv().await {
for msg_chunk in messages.chunks(msg_batch_size) {
match send_messages(
contact,
cosmos_key,
gas_price.to_owned(),
msg_chunk.to_vec(),
gas_adjustment,
)

Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant