Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Txn latency between websockets and http #3074

Open
avinashbo opened this issue Oct 4, 2024 · 11 comments
Open

Txn latency between websockets and http #3074

avinashbo opened this issue Oct 4, 2024 · 11 comments
Assignees

Comments

@avinashbo
Copy link

Problem

We are noticing some latency with txn responses on our mainnet nodes running v1.18.23

  1. We are subscribing to all transactions over the pubKey ComputeBudget111111111111111111111111111111 using the method logsSubscribe using the commitment level finalized
  2. For each txn hash retrieved, we try to do a getTransaction on those hashes this time using a commitment level of confirmed

Now some of the hashes return a null and the rate of null we witness decreases when we introduce a sleep between logSubscribe and getTransaction . We are trying to understand why this would happen when both the websocket and http calls are being served by the same node.

Proposed Solution

@ceciEstErmat
Copy link

It's not really surprising, the only question is why do you need to refetch the hash after ? If you subscribe to tx events you should already have all the infos.

@akegaviar
Copy link

@ceciEstErmat can you please elaborate on why it's not surprising? I'd really appreciate that as I'm not well versed in what's going on under the hood in this scenario.

As for why, it's useful for sniping bots in different cases and it doesn't have to be a getTransaction specifically, but a call that you need to do asap on the data acquired through logsSubscribe.

Example:

  1. A bot is a running logsSubscribe to get the newly minted token address.
  2. Once the the minted token address is detected through logsSubscribe over websocket, the bot needs to create an associated token account for the token.
  3. The bot can't make this transaction right away as the node is yet unaware of the token address acquired through the emitted event in logsSubscribe. This is pretty much the same scenario as sending a getTransaction request.
  4. Introducing sleep always fixes the issue as apparently there's enough time for the token address to propagate.

It doesn't have to be a bot scenario; just an example from the fields.

@bw-solana
Copy link

@akegaviar - In your example, are the subscription (e.g. logsSubscribe) and follow-up query (e.g. getTransaction) being sent to the same node?

I'm trying to understand how there might be a race condition between nodes' chain state. Going from finalized to confirmed commitment level should provide a really wide window allowing all nodes to be able to observe chain state.

@ceciEstErmat
Copy link

ceciEstErmat commented Oct 7, 2024

@akegaviar To me it is not surprising as most RPC clusters are "eventually consistent" - you might be using an RPC provider and be assigned to event subscription on machine A but request machine B. Machine B might not yet at the same state as A.

Distributed system are for the most part "eventually consistent" this is why it did not surprise me, and coding such a thing i would expect such behavior.

Unsure if this clarify what i meant.

@steviez
Copy link

steviez commented Oct 7, 2024

In your example, are the subscription (e.g. logsSubscribe) and follow-up query (e.g. getTransaction) being sent to the same node?

This is a critical question before going any further. The two previous comments stitch together a potential problem scenario:

  • Event subscription from node A
  • getTransaction request from node B

Suppose that node A is caught up with the tip of the cluster and node B is ~50 slots behind the tip. You could get a finalized tx hash from node A. Then, when you go to query that tx from node B, it could be the case that node B has not yet replayed that slot/tx. This could lead to node B being unable to return anything about that tx.

Depending on configuration, RPC nodes might be doing a lot of extra work and be under heavier load than "regular" nodes. We have definitely seen scenarios where heavy RPC traffic (requests) can cause an RPC node to fall behind. Rate limiting of nodes was left as something to perform outside of the validator client.

@avinashbo
Copy link
Author

@bw-solana We were able to reproduce the issue when both logsSubscribe and getTransaction were sent to the same node.

I'm trying to understand how there might be a race condition between nodes' chain state

Wondering the same when we are hitting the same node and that too when logsSubscribe using finalized commitment level which takes a little longer than the confirmed commitment level

@bw-solana
Copy link

Makes me think the Transaction Status Service must be backed up. I believe if you request tx details before TSS has processed the updates, RPC server will return null or error

@avinashbo
Copy link
Author

I still cannot understand why it would necessitate introducing a sleep to fix this issue when it is not two different nodes serving websocket and http rpc

@bw-solana
Copy link

I still cannot understand why it would necessitate introducing a sleep to fix this issue when it is not two different nodes serving websocket and http rpc

I agree. The sleep is more of a workaround. We need to figure out what's going on.

My best guess is TSS being slow, but we'll need to confirm.

@lijunwangs lijunwangs self-assigned this Oct 9, 2024
@lijunwangs
Copy link

I have root-caused the issue. The problem is caused by the asynchronous nature of the logsSubscribe notification and the persistence of TransactionStatusMeta by the TransactionStatusService in the ReplayStage.

When a transaction is executed, the transaction logs are collected synchronously and the transaction status batch is sent to TransactionStatusService which will do the persistence asynchronously in a separate thread. Then logsSubscribe event (using CommitmentAggregationData --> NotificationEntry::Bank) is sent. Due to the asynchronously nature of these two processes, it may happen when getTransaction is being called the TransactionStatusMeta has not been written to the block store yet. We have the same mechanism for a long time so I do not think this is a regression.

Recommendations:

  1. Check if TransactionLogInfo associated with the logsSubscribe notification is sufficient
  2. Consider using geyser to stream the transaction events to your plugin which is guaranteed to be notified when the transaction written to blockstore by the TransactionStatusService.

@bw-solana
Copy link

This might help a little #3026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants