Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

track: Concurrent signer state updates cause collisions #494

Closed
1 of 3 tasks
cdecker opened this issue Aug 9, 2024 · 3 comments
Closed
1 of 3 tasks

track: Concurrent signer state updates cause collisions #494

cdecker opened this issue Aug 9, 2024 · 3 comments
Labels

Comments

@cdecker
Copy link
Collaborator

cdecker commented Aug 9, 2024

While testing the upcoming 0.3 release, which includes VLS 0.11.1 to address #431 and #476 we noticed that some of the signer responses aren't being accepted, because they appear to be causing concurrent updates to shared state.

The state is split into smaller domains, as key-value-pairs that can be updated individually. This allows us to attach the signer state to requests early, queuing them up, and avoid having dependencies between requests. The goal is to make the request, and its associated state as self-contained as possible. This depends on how we split the signer state into smaller domains.

Our current theory is that the upgrade to VLS 0.11.1 results in some global state to also be updated a shared domain, thus the second request returning a response, that was based on a stale state, will result in a stale state update, which is rejected.

Roadmap (WIP)

  • Instrument the tower to see the updates in their entirety, allowing us to diff the states, and verify that it is an access to a shared state.
  • Re-split the domains such that requests no longer overlap
  • (maybe) Change the stream_hsm_requests method to only ever have one request in flight at a time, and bind the state late (this will have some performance impact as it does not allow pipelining requests anymore)

What to do for now?

The issue will manifest randomly, but it will also solve itself via a restart. Either let the node be preempted after 15 minutes at most (closing the app and not issuing new commands will let it be preempted), or a call to stop() will stop the node, and the next RPC call will schedule it again.

@cdecker
Copy link
Collaborator Author

cdecker commented Aug 9, 2024

@roeierez
Copy link
Collaborator

@cdecker when that happened to me stop command didn't help. Only after waiting some time the signer went back to normal again

@cdecker
Copy link
Collaborator Author

cdecker commented Aug 13, 2024

Yep, that is to be expected, since the grpc connection is shared, the signer disconnecting may affect the RPC connection.

We have since stopped forwarding the collision errors to the signer, which stops the disconnect-loop. It does not fix the underlying issue, in that we cannot return a signer response to CLN if the associated state update failed, but from what I can tell most (if not all) occurrences are actually no-op collisions (i.e., signer0 sends back state version=100 value=a, and signer1 sends back state version=100, value=a, the same values), in which case we can forward the response to CLN.

We are logging the occurrences and monitoring for any non-no-op collisions, in which case we should also have an idea of where the collision occurs and how to slice the state to avoid these collisions.

Closing this one in a month if no such collisions present themselves.

@cdecker cdecker closed this as completed Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants