You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While testing the upcoming 0.3 release, which includes VLS 0.11.1 to address #431 and #476 we noticed that some of the signer responses aren't being accepted, because they appear to be causing concurrent updates to shared state.
The state is split into smaller domains, as key-value-pairs that can be updated individually. This allows us to attach the signer state to requests early, queuing them up, and avoid having dependencies between requests. The goal is to make the request, and its associated state as self-contained as possible. This depends on how we split the signer state into smaller domains.
Our current theory is that the upgrade to VLS 0.11.1 results in some global state to also be updated a shared domain, thus the second request returning a response, that was based on a stale state, will result in a stale state update, which is rejected.
Roadmap (WIP)
Instrument the tower to see the updates in their entirety, allowing us to diff the states, and verify that it is an access to a shared state.
Re-split the domains such that requests no longer overlap
(maybe) Change the stream_hsm_requests method to only ever have one request in flight at a time, and bind the state late (this will have some performance impact as it does not allow pipelining requests anymore)
What to do for now?
The issue will manifest randomly, but it will also solve itself via a restart. Either let the node be preempted after 15 minutes at most (closing the app and not issuing new commands will let it be preempted), or a call to stop() will stop the node, and the next RPC call will schedule it again.
The text was updated successfully, but these errors were encountered:
Yep, that is to be expected, since the grpc connection is shared, the signer disconnecting may affect the RPC connection.
We have since stopped forwarding the collision errors to the signer, which stops the disconnect-loop. It does not fix the underlying issue, in that we cannot return a signer response to CLN if the associated state update failed, but from what I can tell most (if not all) occurrences are actually no-op collisions (i.e., signer0 sends back state version=100 value=a, and signer1 sends back state version=100, value=a, the same values), in which case we can forward the response to CLN.
We are logging the occurrences and monitoring for any non-no-op collisions, in which case we should also have an idea of where the collision occurs and how to slice the state to avoid these collisions.
Closing this one in a month if no such collisions present themselves.
While testing the upcoming 0.3 release, which includes VLS 0.11.1 to address #431 and #476 we noticed that some of the signer responses aren't being accepted, because they appear to be causing concurrent updates to shared state.
The state is split into smaller domains, as key-value-pairs that can be updated individually. This allows us to attach the signer state to requests early, queuing them up, and avoid having dependencies between requests. The goal is to make the request, and its associated state as self-contained as possible. This depends on how we split the signer state into smaller domains.
Our current theory is that the upgrade to VLS 0.11.1 results in some global state to also be updated a shared domain, thus the second request returning a response, that was based on a stale state, will result in a stale state update, which is rejected.
Roadmap (WIP)
stream_hsm_requests
method to only ever have one request in flight at a time, and bind the state late (this will have some performance impact as it does not allow pipelining requests anymore)What to do for now?
The issue will manifest randomly, but it will also solve itself via a restart. Either let the node be preempted after 15 minutes at most (closing the app and not issuing new commands will let it be preempted), or a call to
stop()
will stop the node, and the next RPC call will schedule it again.The text was updated successfully, but these errors were encountered: