Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EndBlocker slashing may halt the chain #347

Open
andrey-kuprianov opened this issue Oct 5, 2021 · 0 comments
Open

EndBlocker slashing may halt the chain #347

andrey-kuprianov opened this issue Oct 5, 2021 · 0 comments

Comments

@andrey-kuprianov
Copy link

andrey-kuprianov commented Oct 5, 2021

Surfaced from @informalsystems audit of Althea Gravity Bridge at commit 19a4cfe

severity: High
type: Implementation bug
difficulty: Intermediate

Involved artifacts

Description

As specified in the Cosmos SDK documentation:

BeginBlocker and EndBlocker are a way for module developers to add automatic execution of logic to their module. This is a powerful tool that should be used carefully, as complex automatic functions can slow down or even halt the chain.

Also here:

...it is important to remember that application-specific blockchains are deterministic. Developers must be careful not to introduce non-determinism in BeginBlocker or EndBlocker, and must also be careful not to make them too computationally expensive, as gas does not constrain the cost of BeginBlocker and EndBlocker execution.

The Gravity Bridge EndBlocker contains a lot of computations:

// EndBlocker is called at the end of every block
func EndBlocker(ctx sdk.Context, k keeper.Keeper) {
	params := k.GetParams(ctx)
	slashing(ctx, k)
	attestationTally(ctx, k)
	cleanupTimedOutBatches(ctx, k)
	cleanupTimedOutLogicCalls(ctx, k)
	createValsets(ctx, k)
	pruneValsets(ctx, k, params)
	pruneAttestations(ctx, k)
}

Slashing functions, in particular, contain the same pattern of nested loops (here shown on the example of BatchSlashing, shortened):

for _, batch := range unslashedBatches {
    confirms := k.GetBatchConfirmByNonceAndTokenContract(ctx, batch.BatchNonce, batch.TokenContract)
    for _, val := range currentBondedSet {
        for _, conf := range confirms {
            valAddr, foundValidator := k.GetOrchestratorValidator(ctx, confVal)
        }
    }
}

Let's try to estimate the numbers here. Current Ethereum can handle around 30 transactions per second (TPS). Vitalik Buterin, one of the founders of Ethereum, has alleged that Ethereum 2.0 may eventually scale to as many as 100,000 TPS using sharding and other tactics. As the motivation for the Gravity Bridge is to bring transactions from the current slow Ethereum to the much faster Cosmos, suppose the bridge becomes moderately successful, and handles 1,000 TPS flowing from Cosmos to Ethereum.

We have, multiplied:

  • 10 batches (as 1 batch includes 100 transactions)
  • 7 seconds per block
  • 125 bonded validators
  • 125 confirmations (each batch needs to be confirmed by each validator)

This gives us 70 * 125^2 = ~ 1,093,750 operations per nested slashing loop; there are 3 such nested loop constructs, which gives 3,281,250 innermost operations per block. Suppose the innermost loop finishes at half iteration, which gives ~ 1,600,000 operations per block. The innermost operation is this:

    confVal, _ := sdk.AccAddressFromBech32(conf.Orchestrator)
    valAddr, foundValidator := k.GetOrchestratorValidator(ctx, confVal)
    if foundValidator && valAddr.GetOperator().Equals(val.GetOperator()) {
        found = true
        break
    }

Out of those sdk.AccAddressFromBech32 is relatively simple, though still involves non-trivial validation logic. The GetOrchestratorValidator, on the other hand, spans ~80 lines of code, involving calls to the store and to the stacking keeper, which are themselves non-trivial. Let's be highly speculative, and estimate each line costing 1 clock cycle, with 100 lines in total. This gives 160,000,000 clock cycles -- that's a lot of processing only for the slashing in the EndBlocker.

This is a highly speculative estimation; but it gives the general idea: the computations in the EndBlocker are too heavy. What is particularly worrisome is the multiplication linear dependency on the number of transactions, and quadratic dependency on the number of validators. This means, e.g., if a bridge becomes successful, and sees an 10x influx of users, the computational load on executing Cosmos transactions will also increase 10x. The situation can become more dramatic quickly, because batch creation can be triggered by anyone, even for a single transaction. Going from batches of 100 transactions to batches of 1 transaction means an instant 100x increase in computational requirements: this may happen both due to legitimate users trying to get their transactions through faster, as well as due to the deliberate attacks on the Cosmos chain. Each of the aforementioned increases in computational requirements should be multiplied with the quadratic validator dependency.

Problem Scenarios

The root of the problem lies in the linear and quadratic dependency of the block computational load on the number of transactions / validators, respectively. The following scenarios seem feasible:

  • Gravity Bridge starts working fine, and attracts more users. At some point it becomes sufficiently popular that the computational load on full nodes is no longer bearable, the Cosmos chain performance deteriorates.
  • With the deterioration of chain performance, the desperate users will try to get their tokens through the bridge, and will start to request batches with a single transaction. This suddenly increases the computational load on full nodes by a factor of 100. Alternatively, the 100 increase can be triggered maliciously by submitting a lot of 1-transaction batches. The chain halts.
  • Another consequence is, most probably, that when the chain recovers, all validators will be slashed by Gravity Bridge in one sweep, because they were not able to confirm transactions.

Recommendation

  • Reduce as much as possible cost of computations done in the EndBlocker; try to move most computations away from the EndBlocker.
  • Re-think the slashing algorithm. Get rid of the quadratic dependency of the computational cost on the number of validators; the dependency should be at most linear.
  • The most simple and straightforward approach is e.g. to store the set of awaited confirmations, and remove an element from this set as soon as the confirmation arrives. This distributes the load to the time when a confirmation arrives, and makes slashing in the EndBlocker a straightforward linear pass over expected, but not received confirmations.
  • Other optimizations are possible, and should be used; e.g., using a HashSet would reduce the computational cost to logarithmic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant