Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Istanbul: bigger blockperiod value cause blockchain stuck #583

Closed
yjfcn opened this issue Nov 30, 2018 · 24 comments
Closed

Istanbul: bigger blockperiod value cause blockchain stuck #583

yjfcn opened this issue Nov 30, 2018 · 24 comments
Assignees

Comments

@yjfcn
Copy link

yjfcn commented Nov 30, 2018

Geth
Version: 1.8.12-stable
Git Commit: 8c4aea5
Quorum Version: 2.1.1
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1337
Go Version: go1.10.1
Operating System: linux
GOPATH=
GOROOT=/usr/lib/go-1.10

I configured 3 quorum nodes with Istanbul consensus. If I set --istanbul.blockperiod 10, the whole blockchain will stuck after mining some blocks. The default blockperiod value which is 1 works fine. I tried to set blockperiod 5, it also works.

Here is the log:
INFO [11-30|09:43:01.105] Committed address=0x525ad36D11aA5E8139FC21D82485EF66B54AddB2 hash=5efda5…a129aa number=4
INFO [11-30|09:43:01.107] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=618.701µs mgasps=0.000 number=4 hash=5efda5…a129aa cache=0.00B
INFO [11-30|09:43:01.107] Commit new mining work number=5 txs=0 uncles=0 elapsed=159.2µs
INFO [11-30|09:43:11.438] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=645.6µs mgasps=0.000 number=5 hash=f1a986…4a93c9 cache=0.00B
INFO [11-30|09:43:11.439] Commit new mining work number=6 txs=0 uncles=0 elapsed=276.6µs
INFO [11-30|09:43:21.002] Committed address=0x525ad36D11aA5E8139FC21D82485EF66B54AddB2 hash=e785ef…ddc5ac number=6
INFO [11-30|09:43:21.002] Successfully sealed new block number=6 hash=e785ef…ddc5ac
INFO [11-30|09:43:21.002] block reached canonical chain number=1 hash=ca631e…874dd2
INFO [11-30|09:43:21.002] mined potential block number=6 hash=e785ef…ddc5ac
INFO [11-30|09:43:21.003] Commit new mining work number=7 txs=0 uncles=0 elapsed=111.2µs
INFO [11-30|09:43:31.301] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=623.101µs mgasps=0.000 number=7 hash=ea45e4…679276 cache=0.00B
INFO [11-30|09:43:31.301] Commit new mining work number=8 txs=0 uncles=0 elapsed=150.2µs
INFO [11-30|09:43:41.108] Committed address=0x525ad36D11aA5E8139FC21D82485EF66B54AddB2 hash=b7ab7a…547115 number=8
INFO [11-30|09:43:41.109] Successfully sealed new block number=8 hash=b7ab7a…547115
INFO [11-30|09:43:41.109] block reached canonical chain number=3 hash=967955…995fd8
INFO [11-30|09:43:41.109] mined potential block number=8 hash=b7ab7a…547115
INFO [11-30|09:43:41.109] Commit new mining work number=9 txs=0 uncles=0 elapsed=167.7µs
INFO [11-30|09:43:41.438] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=539.7µs mgasps=0.000 number=8 hash=70df0c…773024 cache=0.00B

Looks like if there happens to be two blocks with same block number but different hash, the blockchain will stuck.

@jbhurat jbhurat self-assigned this Dec 3, 2018
@vietlq
Copy link

vietlq commented Dec 4, 2018

I configured 3 quorum nodes with Istanbul consensus

@yjfcn For BFT you need to have N = 3f + 1, means you need at least 4 nodes to have minimum valid number of nodes to run Istanbul BFT. In this case you need N - f = 4 - 1 = 3 nodes to agree on one hash to make it viable. Anything less will cause deadlock.

@fixanoid
Copy link
Contributor

fixanoid commented Dec 4, 2018

#305

@yjfcn
Copy link
Author

yjfcn commented Dec 5, 2018

I configured 3 quorum nodes with Istanbul consensus

@yjfcn For BFT you need to have N = 3f + 1, means you need at least 4 nodes to have minimum valid number of nodes to run Istanbul BFT. In this case you need N - f = 4 - 1 = 3 nodes to agree on one hash to make it viable. Anything less will cause deadlock.

Thanks for your reply, I will add more nodes to test again.

@yjfcn
Copy link
Author

yjfcn commented Dec 6, 2018

After adding some new nodes, now I have 7 nodes. But three of them have the same error message after running several hours:
INFO [12-05|14:54:49.226] Committed address=0x4e595f8d4968372E19A36C5DD87597B76f9ece96 hash=875255…b64481 number=1867
INFO [12-05|14:54:49.230] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=1.727ms mgasps=0.000 number=1867 hash=875255…b64481 cache=0.00B
INFO [12-05|14:54:49.231] Commit new mining work number=1868 txs=0 uncles=0 elapsed=321.996µs
INFO [12-05|14:54:59.165] Committed address=0x4e595f8d4968372E19A36C5DD87597B76f9ece96 hash=cf1ab9…434ef1 number=1868
INFO [12-05|14:54:59.165] Successfully sealed new block number=1868 hash=cf1ab9…434ef1
INFO [12-05|14:54:59.166] mined potential block number=1868 hash=cf1ab9…434ef1
INFO [12-05|14:54:59.166] Commit new mining work number=1869 txs=0 uncles=0 elapsed=202.497µs
INFO [12-05|14:55:09.333] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=1.906ms mgasps=0.000 number=1869 hash=37ce2d…094408 cache=0.00B
INFO [12-05|14:55:09.333] Commit new mining work number=1870 txs=0 uncles=0 elapsed=277.096µs
INFO [12-05|14:55:19.168] Committed address=0x4e595f8d4968372E19A36C5DD87597B76f9ece96 hash=c046bc…6b8491 number=1870
INFO [12-05|14:55:19.172] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=1.690ms mgasps=0.000 number=1870 hash=c046bc…6b8491 cache=0.00B
INFO [12-05|14:55:19.172] Commit new mining work number=1871 txs=0 uncles=0 elapsed=160.697µs
INFO [12-05|14:55:29.204] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=1.781ms mgasps=0.000 number=1871 hash=84cf74…259397 cache=0.00B
INFO [12-05|14:55:29.204] Commit new mining work number=1872 txs=0 uncles=0 elapsed=618.992µs
INFO [12-05|14:55:39.169] Committed address=0x4e595f8d4968372E19A36C5DD87597B76f9ece96 hash=0bf719…7dc254 number=1872
INFO [12-05|14:55:39.172] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=1.633ms mgasps=0.000 number=1872 hash=0bf719…7dc254 cache=0.00B
INFO [12-05|14:55:39.173] Commit new mining work number=1873 txs=0 uncles=0 elapsed=188.598µs
INFO [12-05|14:55:49.361] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=2.026ms mgasps=0.000 number=1873 hash=183c2a…ca2083 cache=0.00B
INFO [12-05|14:55:49.362] Commit new mining work number=1874 txs=0 uncles=0 elapsed=334.095µs
INFO [12-05|14:55:49.362] block reached canonical chain number=1868 hash=cf1ab9…434ef1
INFO [12-05|14:55:59.226] Committed address=0x4e595f8d4968372E19A36C5DD87597B76f9ece96 hash=c0686f…02defe number=1874
INFO [12-05|14:55:59.230] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=2.101ms mgasps=0.000 number=1874 hash=c0686f…02defe cache=0.00B
INFO [12-05|14:55:59.231] Commit new mining work number=1875 txs=0 uncles=0 elapsed=213.197µs
INFO [12-05|14:56:09.177] Committed address=0x4e595f8d4968372E19A36C5DD87597B76f9ece96 hash=40783e…9cd824 number=1875
INFO [12-05|14:56:09.177] Successfully sealed new block number=1875 hash=40783e…9cd824
INFO [12-05|14:56:09.178] mined potential block number=1875 hash=40783e…9cd824
INFO [12-05|14:56:09.178] Commit new mining work number=1876 txs=0 uncles=0 elapsed=253.196µs
INFO [12-05|14:56:19.518] Committed address=0x4e595f8d4968372E19A36C5DD87597B76f9ece96 hash=e59aa4…281e80 number=1876
ERROR[12-05|14:56:19.519] not a valid address err="recovery failed"
INFO [12-05|15:18:51.667] Regenerated local transaction journal transactions=0 accounts=0
INFO [12-05|16:18:51.667] Regenerated local transaction journal transactions=0 accounts=0
INFO [12-05|17:18:51.667] Regenerated local transaction journal transactions=0 accounts=0
INFO [12-05|18:18:51.667] Regenerated local transaction journal transactions=0 accounts=0

The strange thing is:
I tested several times, it's always these 3 nodes which have this error message.
Any one knows why?

@jbhurat
Copy link
Contributor

jbhurat commented Dec 6, 2018

Hi @yjfcn, can you give more details for the below

  • Are you using istanbul-tools or any other tools to generate the network configuration
  • Each time you are testing, are you starting afresh (from block 0)
  • Does the issue occurs around the same block numbers
  • Do you see the above error message in all the 3 nodes

The error itself is returned while trying to recover the public key of the node which proposed the block

@yjfcn
Copy link
Author

yjfcn commented Dec 7, 2018

Hi @yjfcn, can you give more details for the below

  • Are you using istanbul-tools or any other tools to generate the network configuration
  • Each time you are testing, are you starting afresh (from block 0)
  • Does the issue occurs around the same block numbers
  • Do you see the above error message in all the 3 nodes

The error itself is returned while trying to recover the public key of the node which proposed the block

Hi @jbhurat ,

  • Yes, I am using istanbul-tools to generate nodekey, genesis.json, static-nodes.json
  • Yes, each time I reset the blockchain and start from block 0
  • Not around the same block numbers
  • Yes, I saw the same error message on all the 3 nodes, this cause other 4 nodes stuck

I searched log file on all 7 nodes, all nodes have the same error:

cat logs/geth.log | grep "not a valid address"

ERROR[12-06|07:23:11.577] not a valid address err="recovery failed"
ERROR[12-06|08:24:41.587] not a valid address err="recovery failed"
ERROR[12-06|08:37:11.567] not a valid address err="recovery failed"
ERROR[12-06|09:08:31.551] not a valid address err="recovery failed"
ERROR[12-06|09:47:11.564] not a valid address err="recovery failed"
ERROR[12-06|09:47:12.302] not a valid address err="recovery failed"
ERROR[12-06|09:47:20.321] not a valid address err="recovery failed"
ERROR[12-06|09:50:11.630] not a valid address err="recovery failed"
ERROR[12-06|12:02:11.556] not a valid address err="recovery failed"
ERROR[12-06|13:04:11.497] not a valid address err="recovery failed"
ERROR[12-06|13:10:11.618] not a valid address err="recovery failed"
ERROR[12-06|13:13:31.607] not a valid address err="recovery failed"
ERROR[12-06|13:22:01.579] not a valid address err="recovery failed"
ERROR[12-06|13:56:01.519] not a valid address err="recovery failed"
ERROR[12-06|14:24:03.595] not a valid address err="recovery failed"
ERROR[12-06|14:33:13.591] not a valid address err="recovery failed"
ERROR[12-06|14:53:03.557] not a valid address err="recovery failed"
ERROR[12-06|15:14:03.567] not a valid address err="recovery failed"
ERROR[12-06|15:30:13.568] not a valid address err="recovery failed"
ERROR[12-06|15:45:13.645] not a valid address err="recovery failed"
ERROR[12-06|15:48:13.576] not a valid address err="recovery failed"
ERROR[12-06|15:51:43.530] not a valid address err="recovery failed"
ERROR[12-06|16:35:33.558] not a valid address err="recovery failed"
ERROR[12-06|16:43:13.562] not a valid address err="recovery failed"
ERROR[12-06|17:03:23.510] not a valid address err="recovery failed"
ERROR[12-06|17:57:13.531] not a valid address err="recovery failed"
ERROR[12-06|18:00:43.530] not a valid address err="recovery failed"
ERROR[12-06|18:02:13.643] not a valid address err="recovery failed"
ERROR[12-06|18:51:53.495] not a valid address err="recovery failed"
ERROR[12-06|19:05:53.480] not a valid address err="recovery failed"
ERROR[12-06|19:22:13.481] not a valid address err="recovery failed"
ERROR[12-06|19:29:13.484] not a valid address err="recovery failed"
ERROR[12-06|19:35:23.561] not a valid address err="recovery failed"
ERROR[12-06|20:38:33.559] not a valid address err="recovery failed"
ERROR[12-06|21:23:53.542] not a valid address err="recovery failed"
ERROR[12-06|21:56:23.543] not a valid address err="recovery failed"
ERROR[12-06|22:15:13.500] not a valid address err="recovery failed"
ERROR[12-06|22:49:53.490] not a valid address err="recovery failed"
ERROR[12-06|23:10:43.539] not a valid address err="recovery failed"
ERROR[12-06|23:13:23.574] not a valid address err="recovery failed"
ERROR[12-06|23:16:23.502] not a valid address err="recovery failed"
ERROR[12-06|23:19:53.489] not a valid address err="recovery failed"
ERROR[12-06|23:23:43.579] not a valid address err="recovery failed"
ERROR[12-06|23:25:53.641] not a valid address err="recovery failed"
ERROR[12-06|23:46:23.497] not a valid address err="recovery failed"
ERROR[12-07|01:56:23.570] not a valid address err="recovery failed"
ERROR[12-07|02:13:53.572] not a valid address err="recovery failed"
ERROR[12-07|02:31:23.552] not a valid address err="recovery failed"

Because I have 7 nodes, if 3 nodes of them show this error message at the same time, the blockchain will stuck. If only 1 or 2 shows, the blockchain still works.
The latest test continued for a whole night, it seems OK for now although I saw this error message occasionally. I am not sure how long they can work because the frequency of this error occurrence is still high.

@vietlq
Copy link

vietlq commented Dec 7, 2018

Looks like this is related to getamis/istanbul-tools#108

I'm playing with 7nodes example and trying to reproduce the issue.

@yjfcn could you please tell me how to reproduce the issue? Did you create any smart contract & sent any tx while testing? I have ran for a few times but still have not encountered this issue. Thanks.

I'm just curious if this related to using istanbul-tools in certain way :)

@yjfcn
Copy link
Author

yjfcn commented Dec 7, 2018

Looks like this is related to getamis/istanbul-tools#108

@yjfcn could you please tell me how to reproduce the issue? I have ran for a few times but still have not encountered this issue. Thanks.

I installed 7 nodes( different PCs), then use istanbul-tools to generate the configuration files:
istanbul setup --num 7 --nodes --quorum --verbose --save

After that, copy nodekey to corresponding node, copy other configuration files to each node and modify the static-nodes.json and tm.conf files(replace the IP address).

When all these finished, I start the blockchain:
geth --datadir /qdata/dd init /qdata/genesis.json
geth --datadir /qdata/dd --syncmode full --mine --minerthreads 1 --istanbul.blockperiod 10 --rpc --rpcaddr 0.0.0.0 --rpcapi eth,net,web3,quorum --rpccorsdomain=* --nodiscover

After running for several hours, all nodes will have this error message. But the blockchain still works because there are 2 spare nodes.

@yjfcn
Copy link
Author

yjfcn commented Dec 7, 2018

Looks like this is related to getamis/istanbul-tools#108

I'm playing with 7nodes example and trying to reproduce the issue.

@yjfcn could you please tell me how to reproduce the issue? Did you create any smart contract & sent any tx while testing? I have ran for a few times but still have not encountered this issue. Thanks.

I'm just curious if this related to using istanbul-tools in certain way :)

I didn't create any smart contract & sent any tx while testing. You can add the parameter --istanbul.blockperiod 10 when you test.
Maybe you can run for several hours then check the log.

@yjfcn
Copy link
Author

yjfcn commented Dec 7, 2018

Looks like this is related to getamis/istanbul-tools#108

I'm playing with 7nodes example and trying to reproduce the issue.

@yjfcn could you please tell me how to reproduce the issue? Did you create any smart contract & sent any tx while testing? I have ran for a few times but still have not encountered this issue. Thanks.

I'm just curious if this related to using istanbul-tools in certain way :)

I tried 7nodes example(set istanbul.blockperiod 10) , it also has the same error message:
ERROR[12-07|07:53:44.048] not a valid address err="recovery failed"

Also, there is another warning message in 7nodes example:
WARN [12-07|07:39:14.030] Ignore preprepare messages from non-proposer address=0xb131288F355BC27090E542aE0be213c20350B767 from=0xb912De287F9b047B4228436E94B5b78E3Ee16171 state="Accept request"

But this node actually is a proposer except that there are some upper case characters in the address:

istanbul.getValidators()
["0x6571d97f340c8495b661a823f2c2145ca47d63c2", "0x8157d4437104e3b8df4451a85f7b2438ef6699ff", "0xb131288f355bc27090e542ae0be213c20350b767", "0xb912de287f9b047b4228436e94b5b78e3ee16171", "0xd8dba507e85f116b1f7e231ca8525fc9008a6966", "0xe36cbeb565b061217930767886474e3cde903ac5", "0xf512a992f3fb749857d758ffda1330e590fa915e"]

@jbhurat
Copy link
Contributor

jbhurat commented Dec 7, 2018

@yjfcn do you also see ERROR[12-07|02:31:23.552] not a valid address err="recovery failed" messages when block period is set to 1 second

@jbhurat
Copy link
Contributor

jbhurat commented Dec 7, 2018

Can you also increase the verbosity and post the logs from all the nodes

@jbhurat
Copy link
Contributor

jbhurat commented Dec 7, 2018

But this node actually is a proposer except that there are some upper case characters in the address:

istanbul.getValidators()
["0x6571d97f340c8495b661a823f2c2145ca47d63c2", "0x8157d4437104e3b8df4451a85f7b2438ef6699ff", "0xb131288f355bc27090e542ae0be213c20350b767", "0xb912de287f9b047b4228436e94b5b78e3ee16171", "0xd8dba507e85f116b1f7e231ca8525fc9008a6966", "0xe36cbeb565b061217930767886474e3cde903ac5", "0xf512a992f3fb749857d758ffda1330e590fa915e"]

The node is a validator, but nothing in the logs suggests that the node is also the proposer. Also, the proposer changes every round. Can you increase the verbosity and paste the logs, that might give us hints on what the issue is

@vietlq
Copy link

vietlq commented Dec 9, 2018

Thanks @yjfcn for detailed guide. I set --istanbul.blockperiod 10 in the 7nodes example and after ~8 mins of running I could see the issue on Node 7:

ERROR[12-09|22:05:07] not a valid address                      err="recovery failed"

I haven't seen the dead-lock yet, still waiting. My wild guess that if the above issue err="recovery failed" happens across >= f + 1 validators during the same block the chain could see dead-lock. I'm looking at the code trying to understand why the error err="recovery failed" occurred in the first place.

https://github.com/jpmorganchase/quorum/blob/4b44bea1d7f38382d22f810e598f5f15aba5856e/crypto/secp256k1/secp256.go#L118

https://github.com/jpmorganchase/quorum/blob/4b44bea1d7f38382d22f810e598f5f15aba5856e/crypto/secp256k1/ext.h#L30

@vietlq
Copy link

vietlq commented Dec 10, 2018

Good news is I was able to reproduce the dead-lock after merely ~2 hours (block 778).

Because I have 7 nodes, if 3 nodes of them show this error message at the same time, the blockchain will stuck. If only 1 or 2 shows, the blockchain still works.

Just what I suspected: >= f + 1 validators with the error err="recovery failed" during the same block will dead-lock the chain. For my case I got issue on 3 nodes 2, 3 & 5. IBFT can tolerate up to f nodes (max 2 in this case).

Here's the latest logs of the 7nodes group:

node3_1       | ERROR[12-10|06:52:50] not a valid address                      err="recovery failed"
node3_1       | DEBUG[12-10|06:52:50] No more headers available                peer=ac6b1096ca56b9f6
node3_1       | DEBUG[12-10|06:52:50] Header download terminated               peer=ac6b1096ca56b9f6
node3_1       | ERROR[12-10|06:52:50]
node3_1       | ########## BAD BLOCK #########
node3_1       | Chain config: {ChainID: 10 Homestead: <nil> DAO: <nil> DAOSupport: false EIP150: 2 EIP155: 0 EIP158: 3 Byzantium: 1 IsQuorum: true Engine: istanbul}
node3_1       |
node3_1       | Number: 778
node3_1       | Hash: 0x4e49285f6da14ba6b7cfde5d09de26f38076df91a9374aafbb54ab02b3ed3a4b
node3_1       |
node3_1       |
node3_1       | Error: invalid signature
node3_1       | ##############################

node2_1       | ERROR[12-10|06:52:52] not a valid address                      err="recovery failed"
node2_1       | ERROR[12-10|06:52:52]
node2_1       | ########## BAD BLOCK #########
node2_1       | Chain config: {ChainID: 10 Homestead: <nil> DAO: <nil> DAOSupport: false EIP150: 2 EIP155: 0 EIP158: 3 Byzantium: 1 IsQuorum: true Engine: istanbul}
node2_1       |
node2_1       | Number: 778
node2_1       | Hash: 0x4e49285f6da14ba6b7cfde5d09de26f38076df91a9374aafbb54ab02b3ed3a4b
node2_1       |
node2_1       |
node2_1       | Error: invalid signature
node2_1       | ##############################

node5_1       | ERROR[12-10|06:52:54] not a valid address                      err="recovery failed"
node5_1       | ERROR[12-10|06:52:54]
node5_1       | ########## BAD BLOCK #########
node5_1       | Chain config: {ChainID: 10 Homestead: <nil> DAO: <nil> DAOSupport: false EIP150: 2 EIP155: 0 EIP158: 3 Byzantium: 1 IsQuorum: true Engine: istanbul}
node5_1       |
node5_1       | Number: 778
node5_1       | Hash: 0x4e49285f6da14ba6b7cfde5d09de26f38076df91a9374aafbb54ab02b3ed3a4b
node5_1       |
node5_1       |
node5_1       | Error: invalid signature
node5_1       | ##############################

node3_1       | ERROR[12-10|06:53:00] not a valid address                      err="recovery failed"
node3_1       | ERROR[12-10|06:53:00]
node3_1       | ########## BAD BLOCK #########
node3_1       | Chain config: {ChainID: 10 Homestead: <nil> DAO: <nil> DAOSupport: false EIP150: 2 EIP155: 0 EIP158: 3 Byzantium: 1 IsQuorum: true Engine: istanbul}
node3_1       |
node3_1       | Number: 778
node3_1       | Hash: 0x4e49285f6da14ba6b7cfde5d09de26f38076df91a9374aafbb54ab02b3ed3a4b
node3_1       |
node3_1       |
node3_1       | Error: invalid signature
node3_1       | ##############################

node2_1       | ERROR[12-10|06:53:02] not a valid address                      err="recovery failed"
node2_1       | ERROR[12-10|06:53:02]
node2_1       | ########## BAD BLOCK #########
node2_1       | Chain config: {ChainID: 10 Homestead: <nil> DAO: <nil> DAOSupport: false EIP150: 2 EIP155: 0 EIP158: 3 Byzantium: 1 IsQuorum: true Engine: istanbul}
node2_1       |
node2_1       | Number: 778
node2_1       | Hash: 0x4e49285f6da14ba6b7cfde5d09de26f38076df91a9374aafbb54ab02b3ed3a4b
node2_1       |
node2_1       |
node2_1       | Error: invalid signature
node2_1       | ##############################

node5_1       | ERROR[12-10|06:57:04] not a valid address                      err="recovery failed"
node5_1       | ERROR[12-10|06:57:04]
node5_1       | ########## BAD BLOCK #########
node5_1       | Chain config: {ChainID: 10 Homestead: <nil> DAO: <nil> DAOSupport: false EIP150: 2 EIP155: 0 EIP158: 3 Byzantium: 1 IsQuorum: true Engine: istanbul}
node5_1       |
node5_1       | Number: 778
node5_1       | Hash: 0x4e49285f6da14ba6b7cfde5d09de26f38076df91a9374aafbb54ab02b3ed3a4b
node5_1       |
node5_1       |
node5_1       | Error: invalid signature
node5_1       | ##############################

@yjfcn
Copy link
Author

yjfcn commented Dec 10, 2018

Thanks @yjfcn for detailed guide. I set --istanbul.blockperiod 10 in the 7nodes example and after ~8 mins of running I could see the issue on Node 7:

ERROR[12-09|22:05:07] not a valid address                      err="recovery failed"

I haven't seen the dead-lock yet, still waiting. My wild guess that if the above issue err="recovery failed" happens across >= f + 1 validators during the same block the chain could see dead-lock. I'm looking at the code trying to understand why the error err="recovery failed" occurred in the first place.

quorum/crypto/secp256k1/secp256.go

Line 118 in 4b44bea

if C.secp256k1_ext_ecdsa_recover(context, (*C.uchar)(unsafe.Pointer(&pubkey[0])), sigdata, msgdata) == 0 {
quorum/crypto/secp256k1/ext.h

Line 30 in 4b44bea

static int secp256k1_ext_ecdsa_recover(

After running 7nodes example(--istanbul.blockperiod 10) for more than 6 hours, 3 nodes appeared the error err="recovery failed" which caused the blockchain to dead-lock.

cat 1.log|grep "not a valid address"

cat 2.log|grep "not a valid address"

cat 3.log|grep "not a valid address"

cat 4.log|grep "not a valid address"

ERROR[12-07|07:53:44.048] not a valid address err="recovery failed"
ERROR[12-07|12:37:14.034] not a valid address err="recovery failed"
ERROR[12-07|13:39:44.044] not a valid address err="recovery failed"

cat 5.log|grep "not a"

ERROR[12-07|13:39:44.045] not a valid address err="recovery failed"

cat 6.log|grep "not a"

ERROR[12-07|13:39:44.041] not a valid address err="recovery failed"

@yjfcn
Copy link
Author

yjfcn commented Dec 10, 2018

Can you also increase the verbosity and post the logs from all the nodes

OK, will post logs when dead lock appear.

@jbhurat
Copy link
Contributor

jbhurat commented Dec 12, 2018

@yjfcn are the nodes running under docker. If yes, can you try using a non dockerized version. Also, can you confirm if there is no time drift between the nodes.
I have tried to replicate the deadlock in a non virtualized environment, but so far have not been able to

@yjfcn
Copy link
Author

yjfcn commented Dec 13, 2018

@yjfcn are the nodes running under docker. If yes, can you try using a non dockerized version. Also, can you confirm if there is no time drift between the nodes.
I have tried to replicate the deadlock in a non virtualized environment, but so far have not been able to

@jbhurat These nodes are not running under docker. They are decentralized in different place. Maybe there is time drift between the nodes, but it's not more than 30ms.
And I found another thing which is very interesting:
If I set --istanbul.blockperiod 10, the error "not a valid address err="recovery failed" will appear on each nodes after running several hours. If using other number, such as 5, 9, 12 etc, this error will not appear on all nodes. 10 is a magic number!!!
Can you try to set --istanbul.blockperiod 10 for testing?

@yjfcn
Copy link
Author

yjfcn commented Dec 17, 2018

@jbhurat Most likely it's --istanbul.blockperiod 10 cause this problem. If I use other number, all nodes work fine!

@jbhurat
Copy link
Contributor

jbhurat commented Dec 17, 2018

@yjfcn I have been running a 7 node network locally for over 24 hours, I do see not a valid address err="recovery failed" error message in the logs, but no deadlock. I am going through the verbose logs to see what is triggering that error.
Yes, I see that message only when setting --istanbul.blockperiod 10 and no other number

@jbhurat
Copy link
Contributor

jbhurat commented Dec 17, 2018

@yjfcn I have reviewed the logs, and one thing I have noticed that the issue always happens, when there is a round change and depending on what state that particular node is in, it might fail to validate the header.
Also, the default round change timeout istanbul.requesttimeout is 10000 milliseconds which is 10 seconds. This explains why we only see the error when istanbul.blockperiod is set to 10. Can you try setting --istanbul.requesttimeout 15000 and see if this fixes the issue. I am trying to do the same on my end and will let you know what I find.

@yjfcn
Copy link
Author

yjfcn commented Dec 19, 2018

@jbhurat Tried setting --istanbul.requesttimeout 15000 and --istanbul.blockperiod 10 for 24 hours, looks no problem.

@jbhurat
Copy link
Contributor

jbhurat commented Dec 19, 2018

Good to hear. I see the similar results on my tests.

I am closing the issue. Please reopen if you need any further assistance.

@jbhurat jbhurat closed this as completed Dec 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants