Fix flakly half-finish importing cluster test #1532

sarthakaggarwal97 · 2025-01-08T21:23:41Z

With this, the source node is also informed that the slot ownership has been changed.
100 iterations passed for this test, which were flakily failing before.

Fixes #1526.

tests/unit/cluster/half-migrated-slot.tcl

zuiderkwast · 2025-01-08T21:45:35Z

tests/unit/cluster/half-migrated-slot.tcl

 test "Half-finish importing" {
    # Now we half finish 'importing' node
    assert_equal {OK} [$nodeto(link) cluster setslot 609 node $nodeto(id)]
+    assert_equal {OK} [$nodefrom(link) cluster setslot 609 node $nodeto(id)]


Isn't the point of "Half-finish importing" that we should only finish the importing on the target node, not on the source node, and check that the cluster can handle this? Or what does "half-finish" mean here?

@hpatro @zuiderkwast I was not sure if half-finish importing is related to setslots itself or not. Without this, the key was lost in more than a few cases. The output for keys * was empty.
I tried with setting additional timeouts, retries, wait_for_cluster_propagation, it didn't seem to reliably fix it. I can mark this PR as draft, and look more behind the scenes.

Thanks, yes it would be good to understand better why the keys are lost. $nodeto(link) cluster setslot 609 node $nodeto(id) should make this node bump the epoch and the keys are already migrated to this node, so it seems strange that they are lost. Do you know what the cluster_fix line below does?

I dove deep a bit into this.

Do you know what the cluster_fix line below does

I seems to be running the valkey-cli --cluster fix command. Based on documentation ~ in order to fix the cluster so that keys will be migrated according to the hash slots each node is authoritative or not.
Although, it would have run successfully or else the test might have failed with fail "Cluster could not settle with configuration"

I was checking this documentation: https://valkey.io/commands/cluster-setslot/

The order of step 4 and 5 is important. The destination node is responsible for propagating the change to the rest of the cluster.

Looks like one of the nodes was not informed about the slot migration (the test wants to assert during this behaviour only). With this half-finish importing, we are at step 4, when we assert whether the key aga is present or not. When I was locally replicating this, it seems like the request is indeed getting redirected from the non-slot-owning node to the slot-owning node.

╰─ ./src/valkey-cli -c -h 127.0.0.1 -p 7000 ─╯ 127.0.0.1:7000> get aga -> Redirected to slot [609] located at 127.0.0.1:6379 "xyz" 127.0.0.1:6379>

$cluster looks like needs to be refreshed to know which is the master node for the slot 609. I am not sure if there is a way that client redirection fails here. It's a bit hard for me to replicate this manually with commands. So far, I have not been able to even after moving the key few times.

The latest change to refresh the cluster before we inform the nodes about the migrated key seems to reliably fix the test.

The order of step 4 and 5 is important. The destination node is responsible for propagating the change to the rest of the cluster.

This is interesting. I wonder what can happen if this is not followed. This test case is obviously not following this rule, so maybe the test case is invalid? Shall we just delete it?

@PingXie Do you remember what happens if you send SETSLOT NODE to the source node only? Do the other nodes gossip back the ownership of the slot to the old owner?

Also for commandstats, could you reset after the half-migrating test is completed and then capture it. It's a bit difficult to process the information as it is from the start of the server.

DEBUG HTSTATS 0 is empty when the test fails

Output:

[Dictionary HT] [Expires HT]

Command Stats after reset:

Nodeto command stats: # Commandstats cmdstat_restore-asking:calls=1,usec=12,usec_per_call=12.00,rejected_calls=0,failed_calls=0 cmdstat_debug:calls=1,usec=3,usec_per_call=3.00,rejected_calls=0,failed_calls=0 cmdstat_keys:calls=3,usec=10,usec_per_call=3.33,rejected_calls=0,failed_calls=0 cmdstat_dbsize:calls=2,usec=0,usec_per_call=0.00,rejected_calls=0,failed_calls=0 cmdstat_get:calls=1,usec=1,usec_per_call=1.00,rejected_calls=0,failed_calls=0 cmdstat_asking:calls=1,usec=0,usec_per_call=0.00,rejected_calls=0,failed_calls=0 cmdstat_config|resetstat:calls=1,usec=36,usec_per_call=36.00,rejected_calls=0,failed_calls=0 cmdstat_cluster|slots:calls=1,usec=10,usec_per_call=10.00,rejected_calls=0,failed_calls=0 cmdstat_cluster|nodes:calls=4,usec=93,usec_per_call=23.25,rejected_calls=0,failed_calls=0 cmdstat_cluster|setslot:calls=2,usec=12,usec_per_call=6.00,rejected_calls=0,failed_calls=0 cmdstat_cluster|shards:calls=1,usec=11,usec_per_call=11.00,rejected_calls=0,failed_calls=0 cmdstat_cluster|countkeysinslot:calls=2,usec=0,usec_per_call=0.00,rejected_calls=0,failed_calls=0 cmdstat_cluster|info:calls=1,usec=14,usec_per_call=14.00,rejected_calls=0,failed_calls=0

I don't see a delete call :(

While experimenting with the test, this seemed intriguing.

These lines are executed, after the fix_cluster $nodefrom(addr)

puts $nodeto(addr) puts $nodefrom(addr) puts [$nodeto(link) get aga]

I got this response

127.0.0.1:21143 127.0.0.1:21144 [exception]: Executing test client: MOVED 609 127.0.0.1:21144. MOVED 609 127.0.0.1:21144

But HTSTATS for both the nodes aren't showing the key.

I tried running it on mac/linux, it never fails for me. Do we need to build it with any particular flags to reproduce it?

No! Just make and if I run the test in loop for few times. I am on Apple M3 Chip. Are you also checking aarch/arm? (Not sure if this is making a difference.)

tests/unit/cluster/half-migrated-slot.tcl

Signed-off-by: Sarthak Aggarwal <[email protected]>

hpatro reviewed Jan 8, 2025

View reviewed changes

tests/unit/cluster/half-migrated-slot.tcl Outdated Show resolved Hide resolved

zuiderkwast reviewed Jan 8, 2025

View reviewed changes

sarthakaggarwal97 marked this pull request as draft January 8, 2025 22:44

enjoy-binbin changed the title ~~Fixes Test #1526~~ Fix flakly half-finish importing cluster test Jan 9, 2025

sarthakaggarwal97 marked this pull request as ready for review January 9, 2025 18:41

hpatro reviewed Jan 9, 2025

View reviewed changes

tests/unit/cluster/half-migrated-slot.tcl Show resolved Hide resolved

flaky test fix

a697b2f

Signed-off-by: Sarthak Aggarwal <[email protected]>

sarthakaggarwal97 force-pushed the flaky-1526 branch from 40fb786 to 71861a6 Compare January 9, 2025 18:43

reset_cluster before Half-finish importing

489b38e

Signed-off-by: Sarthak Aggarwal <[email protected]>

sarthakaggarwal97 force-pushed the flaky-1526 branch from 71861a6 to 489b38e Compare January 9, 2025 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flakly half-finish importing cluster test #1532

Fix flakly half-finish importing cluster test #1532

sarthakaggarwal97 commented Jan 8, 2025 •

edited by enjoy-binbin

Loading

zuiderkwast Jan 8, 2025

sarthakaggarwal97 Jan 8, 2025

zuiderkwast Jan 8, 2025

sarthakaggarwal97 Jan 9, 2025

zuiderkwast Jan 9, 2025

hpatro Jan 10, 2025

sarthakaggarwal97 Jan 10, 2025 •

edited

Loading

sarthakaggarwal97 Jan 10, 2025

hpatro Jan 10, 2025 •

edited

Loading

sarthakaggarwal97 Jan 10, 2025 •

edited

Loading

Fix flakly half-finish importing cluster test #1532

Are you sure you want to change the base?

Fix flakly half-finish importing cluster test #1532

Conversation

sarthakaggarwal97 commented Jan 8, 2025 • edited by enjoy-binbin Loading

zuiderkwast Jan 8, 2025

Choose a reason for hiding this comment

sarthakaggarwal97 Jan 8, 2025

Choose a reason for hiding this comment

zuiderkwast Jan 8, 2025

Choose a reason for hiding this comment

sarthakaggarwal97 Jan 9, 2025

Choose a reason for hiding this comment

zuiderkwast Jan 9, 2025

Choose a reason for hiding this comment

hpatro Jan 10, 2025

Choose a reason for hiding this comment

sarthakaggarwal97 Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

sarthakaggarwal97 Jan 10, 2025

Choose a reason for hiding this comment

hpatro Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

sarthakaggarwal97 Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

sarthakaggarwal97 commented Jan 8, 2025 •

edited by enjoy-binbin

Loading

sarthakaggarwal97 Jan 10, 2025 •

edited

Loading

hpatro Jan 10, 2025 •

edited

Loading

sarthakaggarwal97 Jan 10, 2025 •

edited

Loading