-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Istanbul: bigger blockperiod value cause blockchain stuck #583
Comments
@yjfcn For BFT you need to have |
Thanks for your reply, I will add more nodes to test again. |
After adding some new nodes, now I have 7 nodes. But three of them have the same error message after running several hours: The strange thing is: |
Hi @yjfcn, can you give more details for the below
The error itself is returned while trying to recover the public key of the node which proposed the block |
Hi @jbhurat ,
I searched log file on all 7 nodes, all nodes have the same error: cat logs/geth.log | grep "not a valid address"ERROR[12-06|07:23:11.577] not a valid address err="recovery failed" Because I have 7 nodes, if 3 nodes of them show this error message at the same time, the blockchain will stuck. If only 1 or 2 shows, the blockchain still works. |
Looks like this is related to getamis/istanbul-tools#108 I'm playing with 7nodes example and trying to reproduce the issue. @yjfcn could you please tell me how to reproduce the issue? Did you create any smart contract & sent any tx while testing? I have ran for a few times but still have not encountered this issue. Thanks. I'm just curious if this related to using |
I installed 7 nodes( different PCs), then use istanbul-tools to generate the configuration files: After that, copy nodekey to corresponding node, copy other configuration files to each node and modify the static-nodes.json and tm.conf files(replace the IP address). When all these finished, I start the blockchain: After running for several hours, all nodes will have this error message. But the blockchain still works because there are 2 spare nodes. |
I didn't create any smart contract & sent any tx while testing. You can add the parameter --istanbul.blockperiod 10 when you test. |
I tried 7nodes example(set istanbul.blockperiod 10) , it also has the same error message: Also, there is another warning message in 7nodes example: But this node actually is a proposer except that there are some upper case characters in the address:
|
@yjfcn do you also see |
Can you also increase the verbosity and post the logs from all the nodes |
The node is a validator, but nothing in the logs suggests that the node is also the proposer. Also, the proposer changes every round. Can you increase the verbosity and paste the logs, that might give us hints on what the issue is |
Thanks @yjfcn for detailed guide. I set
I haven't seen the dead-lock yet, still waiting. My wild guess that if the above issue |
Good news is I was able to reproduce the dead-lock after merely ~2 hours (block 778).
Just what I suspected: Here's the latest logs of the 7nodes group:
|
After running 7nodes example(--istanbul.blockperiod 10) for more than 6 hours, 3 nodes appeared the error err="recovery failed" which caused the blockchain to dead-lock. cat 1.log|grep "not a valid address"cat 2.log|grep "not a valid address"cat 3.log|grep "not a valid address"cat 4.log|grep "not a valid address"ERROR[12-07|07:53:44.048] not a valid address err="recovery failed" cat 5.log|grep "not a"ERROR[12-07|13:39:44.045] not a valid address err="recovery failed" cat 6.log|grep "not a"ERROR[12-07|13:39:44.041] not a valid address err="recovery failed" |
OK, will post logs when dead lock appear. |
@yjfcn are the nodes running under docker. If yes, can you try using a non dockerized version. Also, can you confirm if there is no time drift between the nodes. |
@jbhurat These nodes are not running under docker. They are decentralized in different place. Maybe there is time drift between the nodes, but it's not more than 30ms. |
@jbhurat Most likely it's --istanbul.blockperiod 10 cause this problem. If I use other number, all nodes work fine! |
@yjfcn I have been running a 7 node network locally for over 24 hours, I do see |
@yjfcn I have reviewed the logs, and one thing I have noticed that the issue always happens, when there is a round change and depending on what state that particular node is in, it might fail to validate the header. |
@jbhurat Tried setting --istanbul.requesttimeout 15000 and --istanbul.blockperiod 10 for 24 hours, looks no problem. |
Good to hear. I see the similar results on my tests. I am closing the issue. Please reopen if you need any further assistance. |
Geth
Version: 1.8.12-stable
Git Commit: 8c4aea5
Quorum Version: 2.1.1
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1337
Go Version: go1.10.1
Operating System: linux
GOPATH=
GOROOT=/usr/lib/go-1.10
I configured 3 quorum nodes with Istanbul consensus. If I set --istanbul.blockperiod 10, the whole blockchain will stuck after mining some blocks. The default blockperiod value which is 1 works fine. I tried to set blockperiod 5, it also works.
Here is the log:
INFO [11-30|09:43:01.105] Committed address=0x525ad36D11aA5E8139FC21D82485EF66B54AddB2 hash=5efda5…a129aa number=4
INFO [11-30|09:43:01.107] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=618.701µs mgasps=0.000 number=4 hash=5efda5…a129aa cache=0.00B
INFO [11-30|09:43:01.107] Commit new mining work number=5 txs=0 uncles=0 elapsed=159.2µs
INFO [11-30|09:43:11.438] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=645.6µs mgasps=0.000 number=5 hash=f1a986…4a93c9 cache=0.00B
INFO [11-30|09:43:11.439] Commit new mining work number=6 txs=0 uncles=0 elapsed=276.6µs
INFO [11-30|09:43:21.002] Committed address=0x525ad36D11aA5E8139FC21D82485EF66B54AddB2 hash=e785ef…ddc5ac number=6
INFO [11-30|09:43:21.002] Successfully sealed new block number=6 hash=e785ef…ddc5ac
INFO [11-30|09:43:21.002] block reached canonical chain number=1 hash=ca631e…874dd2
INFO [11-30|09:43:21.002] mined potential block number=6 hash=e785ef…ddc5ac
INFO [11-30|09:43:21.003] Commit new mining work number=7 txs=0 uncles=0 elapsed=111.2µs
INFO [11-30|09:43:31.301] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=623.101µs mgasps=0.000 number=7 hash=ea45e4…679276 cache=0.00B
INFO [11-30|09:43:31.301] Commit new mining work number=8 txs=0 uncles=0 elapsed=150.2µs
INFO [11-30|09:43:41.108] Committed address=0x525ad36D11aA5E8139FC21D82485EF66B54AddB2 hash=b7ab7a…547115 number=8
INFO [11-30|09:43:41.109] Successfully sealed new block number=8 hash=b7ab7a…547115
INFO [11-30|09:43:41.109] block reached canonical chain number=3 hash=967955…995fd8
INFO [11-30|09:43:41.109] mined potential block number=8 hash=b7ab7a…547115
INFO [11-30|09:43:41.109] Commit new mining work number=9 txs=0 uncles=0 elapsed=167.7µs
INFO [11-30|09:43:41.438] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=539.7µs mgasps=0.000 number=8 hash=70df0c…773024 cache=0.00B
Looks like if there happens to be two blocks with same block number but different hash, the blockchain will stuck.
The text was updated successfully, but these errors were encountered: