Txn latency between websockets and http #3074

avinashbo · 2024-10-04T12:41:52Z

Problem

We are noticing some latency with txn responses on our mainnet nodes running v1.18.23

We are subscribing to all transactions over the pubKey ComputeBudget111111111111111111111111111111 using the method logsSubscribe using the commitment level finalized
For each txn hash retrieved, we try to do a getTransaction on those hashes this time using a commitment level of confirmed

Now some of the hashes return a null and the rate of null we witness decreases when we introduce a sleep between logSubscribe and getTransaction . We are trying to understand why this would happen when both the websocket and http calls are being served by the same node.

Proposed Solution

The text was updated successfully, but these errors were encountered:

ceciEstErmat · 2024-10-05T12:49:11Z

It's not really surprising, the only question is why do you need to refetch the hash after ? If you subscribe to tx events you should already have all the infos.

akegaviar · 2024-10-07T07:55:38Z

@ceciEstErmat can you please elaborate on why it's not surprising? I'd really appreciate that as I'm not well versed in what's going on under the hood in this scenario.

As for why, it's useful for sniping bots in different cases and it doesn't have to be a getTransaction specifically, but a call that you need to do asap on the data acquired through logsSubscribe.

Example:

A bot is a running logsSubscribe to get the newly minted token address.
Once the the minted token address is detected through logsSubscribe over websocket, the bot needs to create an associated token account for the token.
The bot can't make this transaction right away as the node is yet unaware of the token address acquired through the emitted event in logsSubscribe. This is pretty much the same scenario as sending a getTransaction request.
Introducing sleep always fixes the issue as apparently there's enough time for the token address to propagate.

It doesn't have to be a bot scenario; just an example from the fields.

bw-solana · 2024-10-07T15:57:37Z

@akegaviar - In your example, are the subscription (e.g. logsSubscribe) and follow-up query (e.g. getTransaction) being sent to the same node?

I'm trying to understand how there might be a race condition between nodes' chain state. Going from finalized to confirmed commitment level should provide a really wide window allowing all nodes to be able to observe chain state.

ceciEstErmat · 2024-10-07T17:00:06Z

@akegaviar To me it is not surprising as most RPC clusters are "eventually consistent" - you might be using an RPC provider and be assigned to event subscription on machine A but request machine B. Machine B might not yet at the same state as A.

Distributed system are for the most part "eventually consistent" this is why it did not surprise me, and coding such a thing i would expect such behavior.

Unsure if this clarify what i meant.

steviez · 2024-10-07T20:26:35Z

In your example, are the subscription (e.g. logsSubscribe) and follow-up query (e.g. getTransaction) being sent to the same node?

This is a critical question before going any further. The two previous comments stitch together a potential problem scenario:

Event subscription from node A
getTransaction request from node B

Suppose that node A is caught up with the tip of the cluster and node B is ~50 slots behind the tip. You could get a finalized tx hash from node A. Then, when you go to query that tx from node B, it could be the case that node B has not yet replayed that slot/tx. This could lead to node B being unable to return anything about that tx.

Depending on configuration, RPC nodes might be doing a lot of extra work and be under heavier load than "regular" nodes. We have definitely seen scenarios where heavy RPC traffic (requests) can cause an RPC node to fall behind. Rate limiting of nodes was left as something to perform outside of the validator client.

avinashbo · 2024-10-08T04:15:54Z

@bw-solana We were able to reproduce the issue when both logsSubscribe and getTransaction were sent to the same node.

I'm trying to understand how there might be a race condition between nodes' chain state

Wondering the same when we are hitting the same node and that too when logsSubscribe using finalized commitment level which takes a little longer than the confirmed commitment level

bw-solana · 2024-10-08T04:18:44Z

Makes me think the Transaction Status Service must be backed up. I believe if you request tx details before TSS has processed the updates, RPC server will return null or error

avinashbo · 2024-10-08T04:40:54Z

I still cannot understand why it would necessitate introducing a sleep to fix this issue when it is not two different nodes serving websocket and http rpc

bw-solana · 2024-10-08T12:45:58Z

I still cannot understand why it would necessitate introducing a sleep to fix this issue when it is not two different nodes serving websocket and http rpc

I agree. The sleep is more of a workaround. We need to figure out what's going on.

My best guess is TSS being slow, but we'll need to confirm.

lijunwangs · 2024-10-19T20:47:15Z

I have root-caused the issue. The problem is caused by the asynchronous nature of the logsSubscribe notification and the persistence of TransactionStatusMeta by the TransactionStatusService in the ReplayStage.

When a transaction is executed, the transaction logs are collected synchronously and the transaction status batch is sent to TransactionStatusService which will do the persistence asynchronously in a separate thread. Then logsSubscribe event (using CommitmentAggregationData --> NotificationEntry::Bank) is sent. Due to the asynchronously nature of these two processes, it may happen when getTransaction is being called the TransactionStatusMeta has not been written to the block store yet. We have the same mechanism for a long time so I do not think this is a regression.

Recommendations:

Check if TransactionLogInfo associated with the logsSubscribe notification is sufficient
Consider using geyser to stream the transaction events to your plugin which is guaranteed to be notified when the transaction written to blockstore by the TransactionStatusService.

bw-solana · 2024-10-19T23:51:06Z

This might help a little #3026

avinashbo added the community label Oct 4, 2024

lijunwangs self-assigned this Oct 9, 2024

akegaviar mentioned this issue Oct 21, 2024

cannot convert 'Transaction' object to bytes chainstacklabs/pump-fun-bot#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Txn latency between websockets and http #3074

Txn latency between websockets and http #3074

avinashbo commented Oct 4, 2024

ceciEstErmat commented Oct 5, 2024

akegaviar commented Oct 7, 2024

bw-solana commented Oct 7, 2024

ceciEstErmat commented Oct 7, 2024 •

edited

Loading

steviez commented Oct 7, 2024

avinashbo commented Oct 8, 2024

bw-solana commented Oct 8, 2024

avinashbo commented Oct 8, 2024

bw-solana commented Oct 8, 2024

lijunwangs commented Oct 19, 2024

bw-solana commented Oct 19, 2024

Txn latency between websockets and http #3074

Txn latency between websockets and http #3074

Comments

avinashbo commented Oct 4, 2024

Problem

Proposed Solution

ceciEstErmat commented Oct 5, 2024

akegaviar commented Oct 7, 2024

bw-solana commented Oct 7, 2024

ceciEstErmat commented Oct 7, 2024 • edited Loading

steviez commented Oct 7, 2024

avinashbo commented Oct 8, 2024

bw-solana commented Oct 8, 2024

avinashbo commented Oct 8, 2024

bw-solana commented Oct 8, 2024

lijunwangs commented Oct 19, 2024

bw-solana commented Oct 19, 2024

ceciEstErmat commented Oct 7, 2024 •

edited

Loading