-
Notifications
You must be signed in to change notification settings - Fork 7.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZOOKEEPER-2789: Reassign ZXID
for solving 32bit overflow problem
#262
base: master
Are you sure you want to change the base?
Conversation
Thinking about some abnormal situations, maybe 24 bit for |
Seems like all test cases passed, but some problems happened in [exec] /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/build.xml:1298: The following error occurred while executing this line:
[exec] /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/build.xml:1308: exec returned: 2
[exec]
[exec] Total time: 15 minutes 45 seconds
[exec] /bin/kill -9 16911
[exec] [exec] Zookeeper_operations::testAsyncWatcher1 : assertion : elapsed 1044
[exec] [exec] Zookeeper_operations::testAsyncGetOperation : elapsed 4 : OK
[exec] [exec] Zookeeper_operations::testOperationsAndDisconnectConcurrently1FAIL: zktest-mt
[exec] [exec] ==========================================
[exec] [exec] 1 of 2 tests failed
[exec] [exec] Please report to [email protected]
[exec] [exec] ==========================================
[exec] [exec] make[1]: Leaving directory `/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/build/test/test-cppunit`
[exec] [exec] /bin/bash: line 5: 15116 Segmentation fault ZKROOT=/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/src/c/../.. CLASSPATH=$CLASSPATH:$CLOVER_HOME/lib/clover.jar ${dir}$tst
[exec] [exec] make[1]: *** [check-TESTS] Error 1
[exec] [exec] make: *** [check-am] Error 2
[exec]
[exec] Running contrib tests.
[exec] ======================================================================
[exec]
[exec] /home/jenkins/tools/ant/apache-ant-1.9.9/bin/ant -DZookeeperPatchProcess= -Dtest.junit.output.format=xml -Dtest.output=yes test-contrib
[exec] Buildfile: /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/build.xml
[exec]
[exec] test-contrib:
[exec]
[exec] BUILD SUCCESSFUL
[exec] Total time: 0 seconds |
@@ -121,8 +121,8 @@ public JsonGenerator(LogIterator iter) { | |||
} else if ((m = newElectionP.matcher(e.getEntry())).find()) { | |||
Iterator<Integer> iterator = servers.iterator(); | |||
long zxid = Long.valueOf(m.group(2)); | |||
int count = (int)zxid;// & 0xFFFFFFFFL; | |||
int epoch = (int)Long.rotateRight(zxid, 32);// >> 32; | |||
long count = zxid & 0xffffffffffL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this be all over the code base instead of a function somewhere in a util file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right!
int count = (int)zxid;// & 0xFFFFFFFFL; | ||
int epoch = (int)Long.rotateRight(zxid, 32);// >> 32; | ||
long count = zxid & 0xffffffffffL; | ||
int epoch = (int)Long.rotateRight(zxid, 40);// >> 40; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, 40 shouldn't fly around in the code base like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already unify all code those processing ZXID
into using ZixdUtils
.
Due to this jvm bug, JDK7 cannot recognition [javac] /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/src/contrib/loggraph/src/java/org/apache/zookeeper/graph/JsonGenerator.java:129: error: cannot find symbol
[javac] long epoch = getEpochFromZxid(zxid); |
1.change the Zxid from high32_low32 to high24_low40 can avoid concurrent problem you hava referred to? |
Hi, @maoling. Thanks for your discussion. Maybe due to my description is problematic, so make you confused.
|
Hi, @asdf2014 .Thanks for your explanation! But I still have some confusions about the question one:
it all depends on that zxid can not be altered(no write operation after zxid has generated at the first time) in the multithread situation,otherwise epoch and count isn't idempotent.should zxid be decorated by final? |
Hi, I think 48 bits low is better for large throughput zk cluster. |
Hi, @yunfan123 . Thank you for your suggestion. As you said in the opinion, so that it can guarantee a smooth upgrade. However, if the 16-bit |
Hi, @asdf2014 |
@yunfan123 @asdf2014 i have seen this issue a twice over a month period. is there anything one can do to prevent this from happening? maybe allowing for leader restarts at "off peak hours" weekly?(yuck i know) it sound like if we can move forward with this if we move to 48 bits low correct? note version: |
@JarenGlover It's a good idea, but not the best solution. Still we can use the |
Are you seeing this behavior with ZOOKEEPER-1277 applied? If so it's a bug in that change, because after that's applied the leader should shutdown as we approach the rollover. It would be nice to address this by changing the zxid semantics, but I don't believe that's a great idea. Instead I would rather see us address any shortcoming in my original fix (1277) fwiw - what I have seen people do in this situation is to monitor the zxid and when it gets close (say within 10%) of the rollover they have an automated script which restarts the leader, which forces a re-election. However 1277 should be doing this for you. Given you are seeing this issue perhaps you can help with resolving any bugs in 1277? thanks! |
Hi, @phunt . Thank you for your comment. Yeah, we discuss here is due to the ZOOKEEPER-1277 solution is not very well. It causes so many times leader restart. And the restart process even could spend few minutes, which is some situations cannot tolerate it. |
Ok, thanks for the update. fwiw restarting taking a few minutes is going to be an issue regardless, no? Any regular type issue, such as a temporary network outage, could cause the quorum to be lost and leader election triggered. |
Hi, @phunt . Indeed, the |
i think it would be much better to extend ZOOKEEPER-1277 to more transparently do the rollover without a full leader election. the main issue i have with shortening the epoch size is that once the epoch hits the maximum value the ensemble is stuck, nothing can proceed, so we really need to keep the epoch size big enough that we would never hit that condition. i don't think a 16-bit epoch satisfies that requirement. |
Hi, @breed . Thanks for your comment. You are right, we should keep the enough epoch value to avoid meet the epoch overflow. So i offered a better solution is 24-bit epoch in second comment. Even if the frequency of leader election is once by every single hours, we will not experience the epoch overflow until 1915.2 years later. |
@asdf2014
|
Hi, @maoling. Thank you for your comments. As you said, if we cannot carry the version of server, it will be too difficult to maintain backward compatibility. The reversion in Etcd is to implement the MVCC feature, which seems to be equivalent to the Zookeeper counter, not the entire ZXID. If we consider that design, then maybe we should use more 64bits, convert ZXID from |
I think it is a serious problem. Too frequent re-elections make things complex and make stability indeterminacy. |
Hi, @asdf2014. How about zxid skipping 0x0 value of low 32 bit, and epoch ++ when zxid is rolling over, instead of re-election. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asdf2014 Hi,
Maybe ByteBuffer for epoch is better to be Long.
E.g. when epoch is written into Leader.ACKEPOCH message and read out
But epoch is unlikely to grow to more than 32 bits.
byte ver[] = new byte[4];
ByteBuffer.wrap(ver).putInt(0x10000);
QuorumPacket newEpochPacket = new QuorumPacket(Leader.LEADERINFO, newLeaderZxid, ver, null);
oa.writeRecord(newEpochPacket, "packet");
bufferedOutput.flush();
QuorumPacket ackEpochPacket = new QuorumPacket();
ia.readRecord(ackEpochPacket, "packet");
if (ackEpochPacket.getType() != Leader.ACKEPOCH) {
LOG.error(ackEpochPacket.toString()
+ " is not ACKEPOCH");
return;
}
ByteBuffer bbepoch = ByteBuffer.wrap(ackEpochPacket.getData());
// ByteBuffer for epoch
ss = new StateSummary(bbepoch.getInt(), ackEpochPacket.getZxid());
learnerMaster.waitForEpochAck(this.getSid(), ss);
Hi, any plans to merge this? We have seen frequent rollouts due to 32-bit zxid. |
Hi @alexfernandez @wg1026688210 @happenking , thanks for your comments, I would like to merge this which could be a big step forward, maybe Apache Kafka wouldn't need to create KRaft, and ClickHouse wouldn't need to establish ClickHouse Keeper as well. Recently, Apache Druid also began supporting Etcd as an optional choice. 😅 |
Hi @asdf2014 , please do! It would be a great improvement. Do you need any help? |
Hi! @asdf2014 please tell me when you are going to do this? |
@happenking |
If it is$2^{32} / (86400 * 1000) \approx 49.7$ days ZXID will exhausted. But, if we reassign the $Math.min(2^{16} / 365, 2^{48} / (86400 * 1000 * 365)) \approx Math.min(179.6, 8925.5) = 179.6$ years.
1k/s
ops, then as long asZXID
into 16bit forepoch
and 48bit forcounter
, then the problem will not occur until afterHowever, i thought the ZXID is
long
type, reading and writing the long type (anddouble
type the same) in JVM, is divided into high 32bit and low 32bit part of the operation, and because theZXID
variable is not modified withvolatile
and is not boxed for the corresponding reference type (Long
/Double
), so it belongs to non-atomic operation. Thus, if the lower 16 bits of the upper 32 bits are divided into the low 32 bits of the entirelong
and become 48 bits low, there may be a concurrent problem.