track: Concurrent signer state updates cause collisions #494

cdecker · 2024-08-09T09:49:08Z

While testing the upcoming 0.3 release, which includes VLS 0.11.1 to address #431 and #476 we noticed that some of the signer responses aren't being accepted, because they appear to be causing concurrent updates to shared state.

The state is split into smaller domains, as key-value-pairs that can be updated individually. This allows us to attach the signer state to requests early, queuing them up, and avoid having dependencies between requests. The goal is to make the request, and its associated state as self-contained as possible. This depends on how we split the signer state into smaller domains.

Our current theory is that the upgrade to VLS 0.11.1 results in some global state to also be updated a shared domain, thus the second request returning a response, that was based on a stale state, will result in a stale state update, which is rejected.

Roadmap (WIP)

Instrument the tower to see the updates in their entirety, allowing us to diff the states, and verify that it is an access to a shared state.
Re-split the domains such that requests no longer overlap
(maybe) Change the stream_hsm_requests method to only ever have one request in flight at a time, and bind the state late (this will have some performance impact as it does not allow pipelining requests anymore)

What to do for now?

The issue will manifest randomly, but it will also solve itself via a restart. Either let the node be preempted after 15 minutes at most (closing the app and not issuing new commands will let it be preempted), or a call to stop() will stop the node, and the next RPC call will schedule it again.

The text was updated successfully, but these errors were encountered:

cdecker · 2024-08-09T09:49:54Z

Instances

roeierez · 2024-08-12T07:57:04Z

@cdecker when that happened to me stop command didn't help. Only after waiting some time the signer went back to normal again

cdecker · 2024-08-13T10:24:52Z

Yep, that is to be expected, since the grpc connection is shared, the signer disconnecting may affect the RPC connection.

We have since stopped forwarding the collision errors to the signer, which stops the disconnect-loop. It does not fix the underlying issue, in that we cannot return a signer response to CLN if the associated state update failed, but from what I can tell most (if not all) occurrences are actually no-op collisions (i.e., signer0 sends back state version=100 value=a, and signer1 sends back state version=100, value=a, the same values), in which case we can forward the response to CLN.

We are logging the occurrences and monitoring for any non-no-op collisions, in which case we should also have an idea of where the collision occurs and how to slice the state to avoid these collisions.

Closing this one in a month if no such collisions present themselves.

cdecker pinned this issue Aug 9, 2024

cdecker added team::greenlight component::vls state::investigating This is currently under investigation labels Aug 9, 2024

This was referenced Aug 9, 2024

Slow API calls #493

Open

plugin: Do not pass signer store persistence error to the client #495

Merged

cdecker closed this as completed Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

track: Concurrent signer state updates cause collisions #494

track: Concurrent signer state updates cause collisions #494

cdecker commented Aug 9, 2024

cdecker commented Aug 9, 2024 •

edited

Loading

roeierez commented Aug 12, 2024

cdecker commented Aug 13, 2024

track: Concurrent signer state updates cause collisions #494

track: Concurrent signer state updates cause collisions #494

Comments

cdecker commented Aug 9, 2024

Roadmap (WIP)

What to do for now?

cdecker commented Aug 9, 2024 • edited Loading

Instances

roeierez commented Aug 12, 2024

cdecker commented Aug 13, 2024

cdecker commented Aug 9, 2024 •

edited

Loading