ZOOKEEPER-2789: Reassign `ZXID` for solving 32bit overflow problem #262

asdf2014 · 2017-05-23T02:04:39Z

If it is 1k/s ops, then as long as $2^{32} / (86400 * 1000) \approx 49.7$ days ZXID will exhausted. But, if we reassign the ZXID into 16bit for epoch and 48bit for counter, then the problem will not occur until after $Math.min(2^{16} / 365, 2^{48} / (86400 * 1000 * 365)) \approx Math.min(179.6, 8925.5) = 179.6$ years.

However, i thought the ZXID is long type, reading and writing the long type (and double type the same) in JVM, is divided into high 32bit and low 32bit part of the operation, and because the ZXID variable is not modified with volatile and is not boxed for the corresponding reference type (Long / Double), so it belongs to non-atomic operation. Thus, if the lower 16 bits of the upper 32 bits are divided into the low 32 bits of the entire long and become 48 bits low, there may be a concurrent problem.

asdf2014 · 2017-05-23T03:00:48Z

Thinking about some abnormal situations, maybe 24 bit for epoch and 40 bit for counter is more better choice: $Math.min(2^{24} / (24 * 365), 2^{40} / (86400 * 1000 * 365)) \approx Math.min(1915.2, 34.9) = 34.9$ years.

asdf2014 · 2017-05-23T09:27:55Z

Seems like all test cases passed, but some problems happened in Zookeeper_operations :: testOperationsAndDisconnectConcurrently1:

     [exec] /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/build.xml:1298: The following error occurred while executing this line:
     [exec] /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/build.xml:1308: exec returned: 2
     [exec] 
     [exec] Total time: 15 minutes 45 seconds
     [exec] /bin/kill -9 16911 
     [exec]      [exec] Zookeeper_operations::testAsyncWatcher1 : assertion : elapsed 1044
     [exec]      [exec] Zookeeper_operations::testAsyncGetOperation : elapsed 4 : OK
     [exec]      [exec] Zookeeper_operations::testOperationsAndDisconnectConcurrently1FAIL: zktest-mt
     [exec]      [exec] ==========================================
     [exec]      [exec] 1 of 2 tests failed
     [exec]      [exec] Please report to [email protected]
     [exec]      [exec] ==========================================
     [exec]      [exec] make[1]: Leaving directory `/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/build/test/test-cppunit`
     [exec]      [exec] /bin/bash: line 5: 15116 Segmentation fault      ZKROOT=/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/src/c/../.. CLASSPATH=$CLASSPATH:$CLOVER_HOME/lib/clover.jar ${dir}$tst
     [exec]      [exec] make[1]: *** [check-TESTS] Error 1
     [exec]      [exec] make: *** [check-am] Error 2
     [exec] 
     [exec]     Running contrib tests.
     [exec] ======================================================================
     [exec] 
     [exec] /home/jenkins/tools/ant/apache-ant-1.9.9/bin/ant -DZookeeperPatchProcess= -Dtest.junit.output.format=xml -Dtest.output=yes test-contrib
     [exec] Buildfile: /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/build.xml
     [exec] 
     [exec] test-contrib:
     [exec] 
     [exec] BUILD SUCCESSFUL
     [exec] Total time: 0 seconds

nerdyyatrice · 2017-05-24T05:50:37Z

src/contrib/loggraph/src/java/org/apache/zookeeper/graph/JsonGenerator.java

@@ -121,8 +121,8 @@ public JsonGenerator(LogIterator iter) {
 		} else if ((m = newElectionP.matcher(e.getEntry())).find()) {
 		    Iterator<Integer> iterator = servers.iterator();
 		    long zxid = Long.valueOf(m.group(2));
-		    int count = (int)zxid;// & 0xFFFFFFFFL;
-		    int epoch = (int)Long.rotateRight(zxid, 32);// >> 32;
+		    long count = zxid & 0xffffffffffL;


How can this be all over the code base instead of a function somewhere in a util file

Yeah, you are right!

nerdyyatrice · 2017-05-24T05:52:30Z

src/contrib/loggraph/src/java/org/apache/zookeeper/graph/JsonGenerator.java

-		    int count = (int)zxid;// & 0xFFFFFFFFL;
-		    int epoch = (int)Long.rotateRight(zxid, 32);// >> 32;
+		    long count = zxid & 0xffffffffffL;
+		    int epoch = (int)Long.rotateRight(zxid, 40);// >> 40;


same, 40 shouldn't fly around in the code base like this

Already unify all code those processing ZXID into using ZixdUtils.

asdf2014 · 2017-05-24T07:28:45Z

Due to this jvm bug, JDK7 cannot recognition import static... I will use fully qualified name replace of it.

    [javac] /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/src/contrib/loggraph/src/java/org/apache/zookeeper/graph/JsonGenerator.java:129: error: cannot find symbol
    [javac] 		    long epoch = getEpochFromZxid(zxid);

maoling · 2017-05-24T11:38:38Z

1.change the Zxid from high32_low32 to high24_low40 can avoid concurrent problem you hava referred to?
2.because the frequency of leading epoch change is much less than the change of counter,so you choose high24_low40?
3.when Zxid has exhausted,it will start from 0 ? or force to a new leader election ? or something else?

asdf2014 · 2017-05-27T01:29:51Z

Hi, @maoling. Thanks for your discussion. Maybe due to my description is problematic, so make you confused.

I am worry about if the lower 8 bits of the upper 32 bits are divided into the low 32 bits of the entire long and become 40 bits low, there may be a concurrent problem. Actually, it shouldn't be worried, all operation about ZXID is bit operation rather than = assignment operation. So, it cann't be a concurrent problem in JVM level.
Yep, it is. Especially, if it is 1k/s ops, then as long as $2^{32} / (86400 * 1000) \approx 49.7$ days ZXID will exhausted. And more terrible situation will make the re-election process comes early. At the same time, the "re-election" process could take half a minute. And it will be cannot acceptable.
As so far, it will throw a XidRolloverException to force re-election process and reset the counter to zero.

maoling · 2017-06-05T10:47:13Z

Hi, @asdf2014 .Thanks for your explanation! But I still have some confusions about the question one:
look at code like this :

int epoch = (int)Long.rotateRight(zxid, 32);// >> 32;
long count = zxid & 0xffffffffffL;

it all depends on that zxid can not be altered(no write operation after zxid has generated at the first time) in the multithread situation,otherwise epoch and count isn't idempotent.should zxid be decorated by final?

asdf2014 · 2017-06-05T11:08:44Z

@maoling You are welcome. Already changed it into following code. Then, i think it can still guarantee its idempotency.

long count = ZxidUtils.getCounterFromZxid(zxid);
long epoch = ZxidUtils.getEpochFromZxid(zxid);

yunfan123 · 2017-06-15T08:14:35Z

Hi, I think 48 bits low is better for large throughput zk cluster.
Another benefits is when use 48 bits low we assuming the epoch low than (1<<16), so we can 16 bits high to judge whether it is old version or new version.
So use 48 bits low we can make the upgrade progress smoothly

asdf2014 · 2017-06-15T08:35:15Z

Hi, @yunfan123 . Thank you for your suggestion. As you said in the opinion, so that it can guarantee a smooth upgrade. However, if the 16-bit epoch overflow rather than the counter overflow, it will make Zookeeper cannot keep provide services by re-election anymore. So, i thought we should keep enough space for epoch. What you think?

yunfan123 · 2017-06-16T07:43:29Z

Hi, @asdf2014
In most cases, I don't think the epoch can overflow 16-bit.
In general, zookeeper leader election is very rare, and it may take several seconds even several minutes to finish leader election.
And zookeeper is totally unavailable during leader election.
If the zookeeper that you use can overflow 16-bits, it turns out the zookeeper you used is totally unreliable.
Finally, compatible with old version is really important.
If not compatible with old versions, I must restart all my zookeeper nodes.
All of nodes need reload snapshot and log from disk, it will cost a lot of time.
I believe this upgrade process is unacceptable by most zookeeper users.

JarenGlover · 2017-12-07T17:49:46Z

@yunfan123 @asdf2014 i have seen this issue a twice over a month period.

is there anything one can do to prevent this from happening? maybe allowing for leader restarts at "off peak hours" weekly?(yuck i know)

it sound like if we can move forward with this if we move to 48 bits low correct?

note version: 3.4.10

asdf2014 · 2017-12-08T01:32:29Z

@JarenGlover It's a good idea, but not the best solution. Still we can use the restart operation to solve this problem without any changes to Zookeeper for now. BTW, you are welcome to get more details of this problem in my blog yuzhouwan.com :-)

phunt · 2017-12-12T18:26:16Z

Are you seeing this behavior with ZOOKEEPER-1277 applied? If so it's a bug in that change, because after that's applied the leader should shutdown as we approach the rollover.

It would be nice to address this by changing the zxid semantics, but I don't believe that's a great idea. Instead I would rather see us address any shortcoming in my original fix (1277)

fwiw - what I have seen people do in this situation is to monitor the zxid and when it gets close (say within 10%) of the rollover they have an automated script which restarts the leader, which forces a re-election. However 1277 should be doing this for you.

Given you are seeing this issue perhaps you can help with resolving any bugs in 1277? thanks!

asdf2014 · 2017-12-13T02:55:27Z

Hi, @phunt . Thank you for your comment. Yeah, we discuss here is due to the ZOOKEEPER-1277 solution is not very well. It causes so many times leader restart. And the restart process even could spend few minutes, which is some situations cannot tolerate it.

phunt · 2017-12-14T05:27:17Z

Ok, thanks for the update. fwiw restarting taking a few minutes is going to be an issue regardless, no? Any regular type issue, such as a temporary network outage, could cause the quorum to be lost and leader election triggered.

asdf2014 · 2017-12-14T06:27:11Z

Hi, @phunt . Indeed, the FastLeaderElection algorithm is very efficient. Most of the leader election situation would finished in hundreds milliseconds. However, some real-time stream frameworks such as Apache Kafka and Apache Storm etc, could make lots of pressures into Zookeeper cluster when they carry on too many business data or processing logic. So maybe, the leader election will be triggered very frequently and the process becomes time consuming.

breed · 2017-12-14T16:13:50Z

i think it would be much better to extend ZOOKEEPER-1277 to more transparently do the rollover without a full leader election.

the main issue i have with shortening the epoch size is that once the epoch hits the maximum value the ensemble is stuck, nothing can proceed, so we really need to keep the epoch size big enough that we would never hit that condition. i don't think a 16-bit epoch satisfies that requirement.

asdf2014 · 2017-12-15T01:26:09Z

Hi, @breed . Thanks for your comment. You are right, we should keep the enough epoch value to avoid meet the epoch overflow. So i offered a better solution is 24-bit epoch in second comment. Even if the frequency of leader election is once by every single hours, we will not experience the epoch overflow until 1915.2 years later.

maoling · 2019-03-06T11:13:18Z

@asdf2014
look at the latest patch,any solution for backward compatibility ?

for backward compatibility, the version info of zk server who generates this zxid may be carried in the zxid?
the design of current zxid benefits from generating a global unique incremental id, however also subject to this.
look at the desigin of etcd for your reference and inspiration,her main-revision is 64 bits which is not mixed up with any info of epoch/term.

// A revision indicates modification of the key-value space.
// The set of changes that share same main revision changes the key-value space atomically.
type revision struct {
	// main is the main revision of a set of changes that happen atomically.
	main int64
	 
	// sub is the the sub revision of a change in a set of changes that happen
	// atomically. Each change has different increasing sub revision in that
	// set.
	sub int64
}

asdf2014 · 2019-03-07T01:54:26Z

Hi, @maoling. Thank you for your comments. As you said, if we cannot carry the version of server, it will be too difficult to maintain backward compatibility. The reversion in Etcd is to implement the MVCC feature, which seems to be equivalent to the Zookeeper counter, not the entire ZXID. If we consider that design, then maybe we should use more 64bits, convert ZXID from epoch(32bits)+counter(32bits) to epoch(64bits)+main_counter(32bits)+sub_counter(32bits). What do you think?

happenking · 2021-06-18T15:32:46Z

I think it is a serious problem. Too frequent re-elections make things complex and make stability indeterminacy.

happenking · 2021-06-27T03:36:52Z

Hi, @asdf2014. How about zxid skipping 0x0 value of low 32 bit, and epoch ++ when zxid is rolling over, instead of re-election.

happenking

@asdf2014 Hi,
Maybe ByteBuffer for epoch is better to be Long.
E.g. when epoch is written into Leader.ACKEPOCH message and read out

But epoch is unlikely to grow to more than 32 bits.

                byte ver[] = new byte[4];
                ByteBuffer.wrap(ver).putInt(0x10000);
                QuorumPacket newEpochPacket = new QuorumPacket(Leader.LEADERINFO, newLeaderZxid, ver, null);
                oa.writeRecord(newEpochPacket, "packet");
                bufferedOutput.flush();
                QuorumPacket ackEpochPacket = new QuorumPacket();
                ia.readRecord(ackEpochPacket, "packet");
                if (ackEpochPacket.getType() != Leader.ACKEPOCH) {
                    LOG.error(ackEpochPacket.toString()
                            + " is not ACKEPOCH");
                    return;
				}
                ByteBuffer bbepoch = ByteBuffer.wrap(ackEpochPacket.getData());

                // ByteBuffer for epoch
                ss = new StateSummary(bbepoch.getInt(), ackEpochPacket.getZxid());

                learnerMaster.waitForEpochAck(this.getSid(), ss);

wg1026688210 · 2023-01-05T06:15:48Z

@asdf2014 @phunt Is there a solution in the rolling over of counter bit now , we have this problem at version 3.4.13

alexfernandez · 2023-10-23T09:57:33Z

Hi, any plans to merge this? We have seen frequent rollouts due to 32-bit zxid.

asdf2014 · 2023-10-24T03:07:21Z

Hi @alexfernandez @wg1026688210 @happenking , thanks for your comments, I would like to merge this which could be a big step forward, maybe Apache Kafka wouldn't need to create KRaft, and ClickHouse wouldn't need to establish ClickHouse Keeper as well. Recently, Apache Druid also began supporting Etcd as an optional choice. 😅

alexfernandez · 2023-10-25T15:55:02Z

Hi @asdf2014 , please do! It would be a great improvement. Do you need any help?

barebu · 2024-02-03T07:26:35Z

Hi! @asdf2014 please tell me when you are going to do this?

zichen-gan · 2024-05-16T03:20:31Z

@happenking
Since I have not yet received a response from the author and have not been granted co-editor permissions, I have created a new PR(#2164) that supports rolling updates based on that PR. Regarding the frequent zxid causing restarts, the proposed solution is to adjust the zxid, prevent re-election after zxid overflow, or just proceed with this current PR, what do you suggest?

nerdyyatrice reviewed May 24, 2017

View reviewed changes

ZOOKEEPER-2789: Reassign ZXID for solving 32bit overflow problem

a78d326

happenking reviewed Aug 15, 2021

View reviewed changes

zichen-gan mentioned this pull request May 16, 2024

ZOOKEEPER-2789 Reassign ZXID for solving 32bit overflow problem #2164

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZOOKEEPER-2789: Reassign `ZXID` for solving 32bit overflow problem #262

ZOOKEEPER-2789: Reassign `ZXID` for solving 32bit overflow problem #262

asdf2014 commented May 23, 2017 •

edited

Loading

asdf2014 commented May 23, 2017 •

edited

Loading

asdf2014 commented May 23, 2017 •

edited

Loading

nerdyyatrice May 24, 2017

asdf2014 May 24, 2017

nerdyyatrice May 24, 2017

asdf2014 May 24, 2017

asdf2014 commented May 24, 2017 •

edited

Loading

maoling commented May 24, 2017 •

edited

Loading

asdf2014 commented May 27, 2017

maoling commented Jun 5, 2017 •

edited

Loading

asdf2014 commented Jun 5, 2017

yunfan123 commented Jun 15, 2017

asdf2014 commented Jun 15, 2017

yunfan123 commented Jun 16, 2017

JarenGlover commented Dec 7, 2017

asdf2014 commented Dec 8, 2017 •

edited

Loading

phunt commented Dec 12, 2017

asdf2014 commented Dec 13, 2017

phunt commented Dec 14, 2017

asdf2014 commented Dec 14, 2017 •

edited

Loading

breed commented Dec 14, 2017

asdf2014 commented Dec 15, 2017 •

edited

Loading

maoling commented Mar 6, 2019

asdf2014 commented Mar 7, 2019

happenking commented Jun 18, 2021 •

edited

Loading

happenking commented Jun 27, 2021 •

edited

Loading

happenking left a comment •

edited

Loading

wg1026688210 commented Jan 5, 2023

alexfernandez commented Oct 23, 2023

asdf2014 commented Oct 24, 2023 •

edited

Loading

alexfernandez commented Oct 25, 2023 •

edited

Loading

barebu commented Feb 3, 2024

zichen-gan commented May 16, 2024

ZOOKEEPER-2789: Reassign ZXID for solving 32bit overflow problem #262

Are you sure you want to change the base?

ZOOKEEPER-2789: Reassign ZXID for solving 32bit overflow problem #262

Conversation

asdf2014 commented May 23, 2017 • edited Loading

asdf2014 commented May 23, 2017 • edited Loading

asdf2014 commented May 23, 2017 • edited Loading

nerdyyatrice May 24, 2017

Choose a reason for hiding this comment

asdf2014 May 24, 2017

Choose a reason for hiding this comment

nerdyyatrice May 24, 2017

Choose a reason for hiding this comment

asdf2014 May 24, 2017

Choose a reason for hiding this comment

asdf2014 commented May 24, 2017 • edited Loading

maoling commented May 24, 2017 • edited Loading

asdf2014 commented May 27, 2017

maoling commented Jun 5, 2017 • edited Loading

asdf2014 commented Jun 5, 2017

yunfan123 commented Jun 15, 2017

asdf2014 commented Jun 15, 2017

yunfan123 commented Jun 16, 2017

JarenGlover commented Dec 7, 2017

asdf2014 commented Dec 8, 2017 • edited Loading

phunt commented Dec 12, 2017

asdf2014 commented Dec 13, 2017

phunt commented Dec 14, 2017

asdf2014 commented Dec 14, 2017 • edited Loading

breed commented Dec 14, 2017

asdf2014 commented Dec 15, 2017 • edited Loading

maoling commented Mar 6, 2019

asdf2014 commented Mar 7, 2019

happenking commented Jun 18, 2021 • edited Loading

happenking commented Jun 27, 2021 • edited Loading

happenking left a comment • edited Loading

Choose a reason for hiding this comment

wg1026688210 commented Jan 5, 2023

alexfernandez commented Oct 23, 2023

asdf2014 commented Oct 24, 2023 • edited Loading

alexfernandez commented Oct 25, 2023 • edited Loading

barebu commented Feb 3, 2024

zichen-gan commented May 16, 2024

ZOOKEEPER-2789: Reassign `ZXID` for solving 32bit overflow problem #262

ZOOKEEPER-2789: Reassign `ZXID` for solving 32bit overflow problem #262

asdf2014 commented May 23, 2017 •

edited

Loading

asdf2014 commented May 23, 2017 •

edited

Loading

asdf2014 commented May 23, 2017 •

edited

Loading

asdf2014 commented May 24, 2017 •

edited

Loading

maoling commented May 24, 2017 •

edited

Loading

maoling commented Jun 5, 2017 •

edited

Loading

asdf2014 commented Dec 8, 2017 •

edited

Loading

asdf2014 commented Dec 14, 2017 •

edited

Loading

asdf2014 commented Dec 15, 2017 •

edited

Loading

happenking commented Jun 18, 2021 •

edited

Loading

happenking commented Jun 27, 2021 •

edited

Loading

happenking left a comment •

edited

Loading

asdf2014 commented Oct 24, 2023 •

edited

Loading

alexfernandez commented Oct 25, 2023 •

edited

Loading