Original leader gets reelected after node restart #472

luos · 2024-09-18T14:58:37Z

luos
Sep 18, 2024

Hi,

I am wondering if this is intentional - or it may be a strict protocol problem, or it may be a bug.

Sometimes we notice in RabbitMQ that a node which was restarted starts hosting leaders again. This is quite unexpected - as new leaders are elected on node shutdown. I am testing this with RabbitMQ 3.13.7 but noticed it with earlier versions as well.

Reproduction:

Three node RabbitMQ cluster 3.13.7.
Create 1000 queues.
Rebalance (docker compose exec rmq1 bash -c "rabbitmq-queues rebalance all")
Verify that queues are distributed well

docker compose exec rmq2 bash -c "rabbitmqctl list_queues leader --silent" | sort | uniq -c
 333 [email protected]_cluster.local
 334 [email protected]_cluster.local
 333 [email protected]_cluster.local

Stop node (rmq1, docker compose stop rmq1) (Same happens with stop_app)
Verify that new leaders are elected

 docker compose exec rmq2 bash -c "rabbitmqctl list_queues leader --silent" | sort | uniq -c
 553 [email protected]_cluster.local
 447 [email protected]_cluster.local

Restart node rmq1 (docker compose start rmq1)
See that some queues are back on rmq1

docker compose exec rmq2 bash -c "rabbitmqctl list_queues leader --silent" | sort | uniq -c
 150 [email protected]_cluster.local
 498 [email protected]_cluster.local
 352 [email protected]_cluster.local

The difference seems to be the following that because on startup rmq1 went into follower state and increased Term by one, now it is on the same term as the new leader. Because there was no traffic, and is_candidate_log_up_to_date returns true (Idx >= LastIdx), the leader grants the vote to the restarted node.

It seems weird to me that this is happening - and if it is happening, then why not to all queues?

If the queue processes receives one message then no leader reversal happens.

Thanks for your answers.

Here is the log for one of such queues. One other difference seems to be to the other queues is that no Leader monitor down with shutdown is visible for this queue - but I am not sure if that's a problem with logs or the queue yet. (I just noticed that it was actually a follower before the restart, so that's probably why.)


rmq1.simple_cluster.local  | 2024-09-18 11:54:36.265246+00:00 [debug] <0.2010.0> queue 'quorum-queue-13' in vhost '/': terminating with shutdown in state follower
rmq1.simple_cluster.local  | 2024-09-18 11:54:36.270570+00:00 [debug] <0.2010.0> queue 'quorum-queue-13' in vhost '/': terminating with reason 'shutdown'



rmq1.simple_cluster.local  | 2024-09-18 11:55:21.755638+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': ra_log:init recovered last_index_term {6,2} first index 0
rmq1.simple_cluster.local  | 2024-09-18 11:55:21.756203+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': post_init -> recover in term: 2 machine version: 3
rmq1.simple_cluster.local  | 2024-09-18 11:55:21.756311+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': recovering state machine version 0:3 from index 0 to 6
rmq1.simple_cluster.local  | 2024-09-18 11:55:21.756864+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': applying new machine version 3 current 0
rmq1.simple_cluster.local  | 2024-09-18 11:55:21.757027+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': enabling ra cluster changes in 2, index 6
rmq1.simple_cluster.local  | 2024-09-18 11:55:21.757380+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': recovery of state machine version 3:3 from index 0 to 6 took 1ms
rmq1.simple_cluster.local  | 2024-09-18 11:55:21.757445+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': scanning for cluster changes 7:6
rmq1.simple_cluster.local  | 2024-09-18 11:55:21.757673+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': recover -> recovered in term: 2 machine version: 3
rmq1.simple_cluster.local  | 2024-09-18 11:55:21.757798+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': recovered -> follower in term: 2 machine version: 3
rmq1.simple_cluster.local  | 2024-09-18 11:55:21.757912+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': is not new, setting election timeout.
rmq1.simple_cluster.local  | 2024-09-18 11:55:22.101938+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': pre_vote election called for in term 2
rmq1.simple_cluster.local  | 2024-09-18 11:55:22.113779+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': follower -> pre_vote in term: 2 machine version: 3
rmq2.simple_cluster.local  | 2024-09-18 11:55:22.113976+00:00 [debug] <0.1583.0> queue 'quorum-queue-13' in vhost '/': granting pre-vote for {'%2F_quorum-queue-13','[email protected]_cluster.local'} machine version (their:ours:effective) 3:3:3 with last indexterm {6,2} for term 2 previous term 2
rmq1.simple_cluster.local  | 2024-09-18 11:55:22.113840+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': pre_vote granted #Ref<0.2203666297.2324955138.60231> for term 2 votes 1
rmq1.simple_cluster.local  | 2024-09-18 11:55:22.114730+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': pre_vote granted #Ref<0.2203666297.2324955138.60231> for term 2 votes 2
rmq1.simple_cluster.local  | 2024-09-18 11:55:22.114969+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': election called for in term 3
rmq3.simple_cluster.local  | 2024-09-18 11:55:22.129895+00:00 [info] <0.1407.0> queue 'quorum-queue-13' in vhost '/': leader saw request_vote_rpc from {'%2F_quorum-queue-13','[email protected]_cluster.local'} for term 3 abdicates term: 2!
rmq3.simple_cluster.local  | 2024-09-18 11:55:22.134775+00:00 [notice] <0.1407.0> queue 'quorum-queue-13' in vhost '/': leader -> follower in term: 3 machine version: 3
rmq2.simple_cluster.local  | 2024-09-18 11:55:22.134956+00:00 [info] <0.1583.0> queue 'quorum-queue-13' in vhost '/': granting vote for {'%2F_quorum-queue-13','[email protected]_cluster.local'} with last indexterm {6,2} for term 3 previous term was 2
rmq3.simple_cluster.local  | 2024-09-18 11:55:22.134931+00:00 [debug] <0.1407.0> queue 'quorum-queue-13' in vhost '/': is not new, setting election timeout.
rmq3.simple_cluster.local  | 2024-09-18 11:55:22.135349+00:00 [info] <0.1407.0> queue 'quorum-queue-13' in vhost '/': granting vote for {'%2F_quorum-queue-13','[email protected]_cluster.local'} with last indexterm {6,2} for term 3 previous term was 3
rmq1.simple_cluster.local  | 2024-09-18 11:55:22.129463+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': pre_vote -> candidate in term: 3 machine version: 3
rmq1.simple_cluster.local  | 2024-09-18 11:55:22.129521+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': vote granted for term 3 votes 1
rmq3.simple_cluster.local  | 2024-09-18 11:55:22.151333+00:00 [info] <0.1407.0> queue 'quorum-queue-13' in vhost '/': detected a new leader {'%2F_quorum-queue-13','[email protected]_cluster.local'} in term 3
rmq2.simple_cluster.local  | 2024-09-18 11:55:22.151601+00:00 [info] <0.1583.0> queue 'quorum-queue-13' in vhost '/': detected a new leader {'%2F_quorum-queue-13','[email protected]_cluster.local'} in term 3
rmq1.simple_cluster.local  | 2024-09-18 11:55:22.141515+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': vote granted for term 3 votes 2
rmq1.simple_cluster.local  | 2024-09-18 11:55:22.152438+00:00 [notice] <0.951.0> queue 'quorum-queue-13' in vhost '/': candidate -> leader in term: 3 machine version: 3
rmq2.simple_cluster.local  | 2024-09-18 11:55:22.166349+00:00 [debug] <0.1583.0> queue 'quorum-queue-13' in vhost '/': enabling ra cluster changes in 3, index 7
rmq3.simple_cluster.local  | 2024-09-18 11:55:22.166342+00:00 [debug] <0.1407.0> queue 'quorum-queue-13' in vhost '/': enabling ra cluster changes in 3, index 7
rmq1.simple_cluster.local  | 2024-09-18 11:55:22.165845+00:00 [debug] <0.951.0> queue 'quorum-queue-13' in vhost '/': enabling ra cluster changes in 3, index 7

Answered by michaelklishin

Sep 18, 2024

If between the moment you stop node A and when node A comes back there were no Raft log updates (e.g. no messages published or consumed + acknowledged), both replicas will have the same log state and the primary difference should be the election term.

Raft's leader election voting (candidate selection) has a random component to it and can fail (end up in a split vote) and retry. When an older leader shows up in the process, that can produce a behavior you are observing.

And the randomness in candidate selection during a vote probably explains why this happens to some queues but not all.

Most importantly: the data in those QQs should still be safe.

View full answer

michaelklishin · 2024-09-18T18:17:50Z

michaelklishin
Sep 18, 2024
Maintainer

If between the moment you stop node A and when node A comes back there were no Raft log updates (e.g. no messages published or consumed + acknowledged), both replicas will have the same log state and the primary difference should be the election term.

Raft's leader election voting (candidate selection) has a random component to it and can fail (end up in a split vote) and retry. When an older leader shows up in the process, that can produce a behavior you are observing.

And the randomness in candidate selection during a vote probably explains why this happens to some queues but not all.

Most importantly: the data in those QQs should still be safe.

0 replies

luos · 2024-09-19T08:57:55Z

luos
Sep 19, 2024
Author

Thank you for the answer, makes sense.

I was thinking it's OK to have this, but was wondering if maybe I am missing something.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Original leader gets reelected after node restart #472

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Original leader gets reelected after node restart #472

luos Sep 18, 2024

Replies: 2 comments

michaelklishin Sep 18, 2024 Maintainer

luos Sep 19, 2024 Author

luos
Sep 18, 2024

michaelklishin
Sep 18, 2024
Maintainer

luos
Sep 19, 2024
Author