Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore projections when a snapshot is installed #259

Merged
merged 3 commits into from
Jun 27, 2024

Conversation

the-mikedavis
Copy link
Member

This is the companion PR to rabbitmq/ra#448

The new ra_machine:snapshot_installed/2 optional callback can be used to sync the Khepri tree and projection tables. Without syncing these, if a member is caught up by the leader with a snapshot, any changes between where the member was in the log and the new snapshot index are not reflected in projection tables. The second commit has a reproduction case and explanation of a bug that is possible if we do not sync the projection tables when a snapshot is installed.

@the-mikedavis the-mikedavis added the bug Something isn't working label Jun 12, 2024
@the-mikedavis the-mikedavis self-assigned this Jun 12, 2024
Copy link

codecov bot commented Jun 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.79%. Comparing base (4621805) to head (28bd581).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #259      +/-   ##
==========================================
+ Coverage   89.64%   89.79%   +0.15%     
==========================================
  Files          21       21              
  Lines        3091     3098       +7     
==========================================
+ Hits         2771     2782      +11     
+ Misses        320      316       -4     
Flag Coverage Δ
erlang-24 88.86% <100.00%> (+0.15%) ⬆️
erlang-25 88.79% <100.00%> (+0.15%) ⬆️
erlang-26 89.50% <100.00%> (+0.18%) ⬆️
os-ubuntu-latest 89.79% <100.00%> (+0.15%) ⬆️
os-windows-latest 89.54% <100.00%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@the-mikedavis the-mikedavis force-pushed the md/projections/restore-on-snapshot-installed branch 7 times, most recently from f2ad654 to 0356e20 Compare June 19, 2024 16:12
Previously the `snapshot_interval` config option only controlled when
`khepri_machine` should emit the `{release_cursor, RaftIndex, State}`
effect. Ra itself will ignore those effects though if they are emitted
too often - an option controlled with the `min_snapshot_interval`
(recently renamed from `snapshot_interval`) log init option in Ra.
To make Ra emit snapshots as often as the specified `snapshot_interval`
option, we need to set the `min_snapshot_interval` log init option to
a value smaller than `snapshot_interval`. Ra handles `release_cursor`
effects when the number of commands applied is greater than but not
equal to the interval, so `min_snapshot_interval` must be smaller than
`snapshot_interval`.
The new version of Ra adds a `ra_machine:snapshot_installed/2` callback
that we will use in the child commit.
@the-mikedavis the-mikedavis force-pushed the md/projections/restore-on-snapshot-installed branch from 0356e20 to 24b2814 Compare June 19, 2024 16:41
@the-mikedavis
Copy link
Member Author

a808303 is not fully related to the snapshot_installed change, just something I noticed while setting up the new test case (in 24b2814)

@the-mikedavis the-mikedavis marked this pull request as ready for review June 19, 2024 16:58
Copy link
Member

@dumbbell dumbbell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great patch, thank you!

Just two minor comments:

A typo in the commit message of the last commit: "the machine state maches the leader", "maches" -> "matches"

Still in the commit message, in the steps to reproduce the problem with RabbitMQ, could you please clarify in the entire paragraph when you talk about RabbitMQ? Because this is Khepri repository, RabbitMQ is kind of out of the context. For example, the first line of that section says "This fixes a bug reproducible in the server with the following reproduction steps:". It should be explicit that you are talking about the "RabbitMQ server". Likewise for the rest of the section when you talk about servers, nodes and so on.

@the-mikedavis the-mikedavis force-pushed the md/projections/restore-on-snapshot-installed branch from 24b2814 to 963bd5c Compare June 26, 2024 16:49
This hooks into the new `ra_machine:snapshot_installed/2` callback
introduced in the Ra version update in the parent commit.

We want to restore projections when a snapshot is installed the same
way as we do when the Khepri machine recovers. A cluster member may be
far behind other members in the cluster causing the cluster leader to
try to "catch up" that member by sending it a snapshot. Once the
snapshot is installed the machine state matches the leader at the
snapshot index, but this 'jump' forward doesn't trigger any changes to
the projections. So we need to sync the new machine state (the tree) and
the projection tables by using the existing `restore_projections` aux
effect.

This fixes a bug reproducible in the RabbitMQ 3.13.3 with the following
reproduction steps:

* Clone and enter the `rabbitmq/rabbitmq-server` repository.
* Start a 3-node RabbitMQ cluster with `make start-cluster`.
* Enable the `khepri_db` feature flag with
  `./sbin/rabbitmqctl enable_feature_flag khepri_db`.
* Start a definitions import with a very large data set, for example
  the `100-queues-with-1000-bindings-each.json` case from the
  `rabbitmq/sample-configs` repo.
    * Un-gzip the case:
      `tar xzf topic-bindings/100-queues-with-1000-bindings-each.json.gz`
      in the sample-configs directory.
    * `./sbin/rabbitmqctl import_definitions \
       path/to/sample-configs/topic-bindings/100-queues-with-1000-bindings-each.json.gz`
      in the rabbitmq-server directory.
* Part-way through the import, perform a rolling restart of the cluster
  with `make restart-cluster` in the rabbitmq-server directory.
* Examine a projection table affected by the definition import. Note the
  discrepancy between numbers of bindings in the `rabbit_khepri_bindings`
  table:

```
for i in 1 2 3; printf "rabbit-$i: "; ./sbin/rabbitmqctl -n rabbit-$i eval 'length(ets:tab2list(rabbit_khepri_bindings)).'; end
rabbit-1: 49003
rabbit-2: 49003
rabbit-3: 23370
```

RabbitMQ uses Khepri with this setup and uses the projections feature to
store some data - like bindings in this example - for fast access. The
rolling restart done by `make restart-cluster` causes RabbitMQ node
rabbit-3's Khepri store to fall behind its peers rabbit-1 and rabbit-2
until they are restarted. Then a new leader catches up rabbit-3 via a
snapshot.

The definition file in this example contains very many bindings which
should end up in the `rabbit_khepri_bindings` table. We see a
discrepancy in the number of bindings in the projection table between
the RabbitMQ nodes because rabbit-3 is caught up by a new leader (either
rabbit-1 or rabbit-2) via a snapshot installation. Before this change
this meant that all bindings which were inserted between rabbit-3's raft
index before and after the snapshot installation are not projected by
Khepri into the `rabbit_khepri_bindings` table. By restoring projections
after the snapshot is installed, all nodes reflect the same numbers of
bindings. Retrying the reproduction steps above after this change in
Khepri causes all `rabbit_khepri_bindings` tables to contain the same
data.
@the-mikedavis the-mikedavis force-pushed the md/projections/restore-on-snapshot-installed branch from 963bd5c to 28bd581 Compare June 26, 2024 18:04
Copy link
Member

@dumbbell dumbbell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@dumbbell dumbbell merged commit 2835f90 into main Jun 27, 2024
12 checks passed
@dumbbell dumbbell deleted the md/projections/restore-on-snapshot-installed branch June 27, 2024 15:35
@dumbbell dumbbell added this to the v0.15.0 milestone Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants