-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve gossip handling of in-memory expired certs #5110
base: main
Are you sure you want to change the base?
Conversation
d17a06e
to
80a77ef
Compare
-Test renew cert before expiration (passes) -Test renew cert after expiration (previously failed) Signed-off-by: David Enyeart <[email protected]>
80a77ef
to
8fb837e
Compare
Dave, FYI - the test is kind of flaky, at least on my laptop. It fails often before the expiration part. Now, to the point - Here is what is happening in the "failing" part of the test (after the certificate expires):
A simple fix would be to naively assume that if the peer's identity was purged from the identity store, to take the new identity returned from the handshake:
This fix, however, would expose the peer to a network spoofing attack in case the identity expires and an adversary routes the communication to the endpoint to itself. [0] It doesn't remember its organization ID, because... well, when this code was written, we didn't have organizations in Fabric (I feel so old writing this...) and we added organizations to gossip only later on. Since the PKI-ID indirectly is also bound to the organization, it's safe to only use the PKI-ID. [1] This was put as a safeguard against a network spoofing attack, because the peer gives gossip all TLS CAs across all organizations as a single bundle and not on a per-org basis. Another fix which I think is more natural is to fail fast in case the identity is no longer present in the identity store, and let the second peer with the renewed identity connect to the first peer instead with its new PKI-ID and identity. The fix is as follows:
|
@yacovm The last fix you mention sounds reasonable. I think the integration test is flaky because the time for gossip membership establishment varies greatly, sometimes it is almost instant and sometimes it takes a minute. Do you have any ideas on how to make the membership establishment time quicker or more consistent, either thru a config change or a code change? |
I think it's something with the nwo, because when it flakes, the logs clearly show that both peers are known, yet it continues querying from some odd reason. Something with the matchers is probably wrong. It's also weird that only one of the peers is queried in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yacovm I made progress on the fix and the test based on your suggestions but still a few issues remain...
integration/gossip/gossip_test.go
Outdated
//By("verifying membership after cert renewed") | ||
// Discovery check fails since discovered peer's cert is not identical to expected peer's cert (content is the same but signature has changed, perhaps due to how cert is re-created in expireCertificate) | ||
// Eventually( | ||
// nwo.DiscoverPeers(network, peer0Org1, "User1", "testchannel"), | ||
// 60*time.Second, | ||
// 100*time.Millisecond).Should(ContainElements(network.DiscoveredPeer(network.Peer("Org2", "peer0"), "_lifecycle")) | ||
// ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the weird Discovery check behavior... it appears that discovered peer's cert is not identical to expected peer's cert and therefore the matcher fails. The cert content matches expectations, but signature has changed, perhaps due to how cert is re-created in expireCertificate() below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be that the cert needs to be sanitized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right again @yacovm , I updated the test to force Low-S (sanitized) signatures for the renewed certs and now it works consistently.
if len(expectedPKIID) > 0 { | ||
identity, _ := c.idMapper.Get(expectedPKIID) | ||
if len(identity) == 0 { | ||
c.logger.Warningf("Identity of %x is no longer found in the identity store, aborting connection", expectedPKIID) | ||
return nil, errors.New("identity no longer recognized") | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This came close to fixing the problem... membership is indeed established now, but the new warning and one of the existing warnings keeps getting repeated in the logs forever:
2025-01-23 00:59:56.269 EST 00a4 WARN [gossip.gossip] disclosurePolicy -> Cannot determine organization of Endpoint: 127.0.0.1:22510, InternalEndpoint: , PKI-ID: fd8548f3b6fb8164a711b6d6dfa2a55f5ae013f512015f884e9af282ef553aa5, Metadata:
2025-01-23 00:59:56.273 EST 00a5 WARN [gossip.comm] createConnection -> Identity of 66643835343866336236666238313634613731316236643664666132613535663561653031336635313230313566383834653961663238326566353533616135 is no longer found in the identity store, aborting connection
2025-01-23 00:59:56.273 EST 00a6 WARN [gossip.comm] sendToEndpoint -> Failed obtaining connection for 127.0.0.1:22510, PKIid:fd8548f3b6fb8164a711b6d6dfa2a55f5ae013f512015f884e9af282ef553aa5 reason: identity no longer recognized
2025-01-23 00:59:57.778 EST 00a7 WARN [gossip.gossip] func3 -> Unable to determine org of message tag:EMPTY alive_msg:{membership:{endpoint:"127.0.0.1:22510" pki_id:"\xfd\x85H\xf3\xb6\xfb\x81d\xa7\x11\xb6\xd6ߢ\xa5_Z\xe0\x13\xf5\x12\x01_\x88N\x9a\xf2\x82\xefU:\xa5"} timestamp:{inc_num:1737611884416490000 seq_num:28}}
2025-01-23 01:00:01.780 EST 00a8 WARN [gossip.gossip] func3 -> Unable to determine org of message tag:EMPTY alive_msg:{membership:{endpoint:"127.0.0.1:22510" pki_id:"\xfd\x85H\xf3\xb6\xfb\x81d\xa7\x11\xb6\xd6ߢ\xa5_Z\xe0\x13\xf5\x12\x01_\x88N\x9a\xf2\x82\xefU:\xa5"} timestamp:{inc_num:1737611884416490000 seq_num:28}}
2025-01-23 01:00:05.781 EST 00a9 WARN [gossip.gossip] func3 -> Unable to determine org of message tag:EMPTY alive_msg:{membership:{endpoint:"127.0.0.1:22510" pki_id:"\xfd\x85H\xf3\xb6\xfb\x81d\xa7\x11\xb6\xd6ߢ\xa5_Z\xe0\x13\xf5\x12\x01_\x88N\x9a\xf2\x82\xefU:\xa5"} timestamp:{inc_num:1737611884416490000 seq_num:28}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This came close to fixing the problem... membership is indeed established now, but the new warning and one of the existing warnings keeps getting repeated in the logs forever:
2025-01-23 00:59:56.269 EST 00a4 WARN [gossip.gossip] disclosurePolicy -> Cannot determine organization of Endpoint: 127.0.0.1:22510, InternalEndpoint: , PKI-ID: fd8548f3b6fb8164a711b6d6dfa2a55f5ae013f512015f884e9af282ef553aa5, Metadata: 2025-01-23 00:59:56.273 EST 00a5 WARN [gossip.comm] createConnection -> Identity of 66643835343866336236666238313634613731316236643664666132613535663561653031336635313230313566383834653961663238326566353533616135 is no longer found in the identity store, aborting connection 2025-01-23 00:59:56.273 EST 00a6 WARN [gossip.comm] sendToEndpoint -> Failed obtaining connection for 127.0.0.1:22510, PKIid:fd8548f3b6fb8164a711b6d6dfa2a55f5ae013f512015f884e9af282ef553aa5 reason: identity no longer recognized 2025-01-23 00:59:57.778 EST 00a7 WARN [gossip.gossip] func3 -> Unable to determine org of message tag:EMPTY alive_msg:{membership:{endpoint:"127.0.0.1:22510" pki_id:"\xfd\x85H\xf3\xb6\xfb\x81d\xa7\x11\xb6\xd6ߢ\xa5_Z\xe0\x13\xf5\x12\x01_\x88N\x9a\xf2\x82\xefU:\xa5"} timestamp:{inc_num:1737611884416490000 seq_num:28}} 2025-01-23 01:00:01.780 EST 00a8 WARN [gossip.gossip] func3 -> Unable to determine org of message tag:EMPTY alive_msg:{membership:{endpoint:"127.0.0.1:22510" pki_id:"\xfd\x85H\xf3\xb6\xfb\x81d\xa7\x11\xb6\xd6ߢ\xa5_Z\xe0\x13\xf5\x12\x01_\x88N\x9a\xf2\x82\xefU:\xa5"} timestamp:{inc_num:1737611884416490000 seq_num:28}} 2025-01-23 01:00:05.781 EST 00a9 WARN [gossip.gossip] func3 -> Unable to determine org of message tag:EMPTY alive_msg:{membership:{endpoint:"127.0.0.1:22510" pki_id:"\xfd\x85H\xf3\xb6\xfb\x81d\xa7\x11\xb6\xd6ߢ\xa5_Z\xe0\x13\xf5\x12\x01_\x88N\x9a\xf2\x82\xefU:\xa5"} timestamp:{inc_num:1737611884416490000 seq_num:28}}
This happens because the peer with the renewed certificate sends the first peer a membership request ("give me all the peers you know about, alive or dead").
The first peer still has the old incarnation of the second peer in its memory. It then asks itself whether it can disclose knowledge of that peer in its membership response, but it cannot determine that because it no longer knows what is the identity corresponding to the PKI-ID of its previous incarnation.
If you run the peer for long enough (I think up to 2 hours?) it will garbage collect that peer's alive messages and it will no longer print that.
Ideally we would purge everything we know about a peer once its certificate has expired.
However I consider this a separate problem and it's orthogonal to this issue.
The fact that the peer issues warnings in the log doesn't mean something is necessarily wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There exists functionality to update PKI-ID and to purge as can be seen in the regular (non-expired) cert renewal scenario, e.g.:
INFO [gossip.comm] createConnection -> Peer 127.0.0.1:22510 changed its PKI-ID from 3257ef1cb73517a1c07b06c0c8f06903cb537199ffb11ecd23643b900f3f43d2 to 9f3e40a456e9f1a22c73e1206bd78b3b1b0ed7276ffbb37b9de00d23530f0877
INFO [gossip.discovery] purge -> Purging 3257ef1cb73517a1c07b06c0c8f06903cb537199ffb11ecd23643b900f3f43d2 from membership
Why wouldn't we do the same in the case of an expired cert renewal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There exists functionality to update PKI-ID and to purge as can be seen in the regular (non-expired) cert renewal scenario, e.g.:
INFO [gossip.comm] createConnection -> Peer 127.0.0.1:22510 changed its PKI-ID from 3257ef1cb73517a1c07b06c0c8f06903cb537199ffb11ecd23643b900f3f43d2 to 9f3e40a456e9f1a22c73e1206bd78b3b1b0ed7276ffbb37b9de00d23530f0877 INFO [gossip.discovery] purge -> Purging 3257ef1cb73517a1c07b06c0c8f06903cb537199ffb11ecd23643b900f3f43d2 from membership
Why wouldn't we do the same in the case of an expired cert renewal?
Yeah I know the code :-)
I didn't have the time to look whether it's the best option from a code change perspective.
I'll look at it more carefully when I have time
f1c1274
to
8f7ec47
Compare
8f7ec47
to
490dca6
Compare
If a peer's cert expires while it is still in gossip memory for another peer, membership cannot be re-established after the cert is renewed. The fix is to acknowledge that the expired cert's peer is no longer in gossip identity store and accept new connection with the new PKI-ID and identity. Signed-off-by: David Enyeart <[email protected]>
490dca6
to
6142f16
Compare
If a peer's cert expires while it is still in gossip memory
for another peer, membership cannot be re-established
after the cert is renewed.
The fix is to acknowledge that the expired cert's peer
is no longer in gossip identity store and accept
new connection with the new PKI-ID and identity.
Add new integration test:
-Test renew cert before expiration (passes)
-Test renew cert after expiration (previously failed)
Resolves #5111