Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve gossip handling of in-memory expired certs #5110

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

denyeart
Copy link
Contributor

@denyeart denyeart commented Jan 20, 2025

If a peer's cert expires while it is still in gossip memory
for another peer, membership cannot be re-established
after the cert is renewed.

The fix is to acknowledge that the expired cert's peer
is no longer in gossip identity store and accept
new connection with the new PKI-ID and identity.

Add new integration test:
-Test renew cert before expiration (passes)
-Test renew cert after expiration (previously failed)

Resolves #5111

-Test renew cert before expiration (passes)
-Test renew cert after expiration (previously failed)

Signed-off-by: David Enyeart <[email protected]>
@denyeart denyeart force-pushed the gossip_renew_cert_test branch from 80a77ef to 8fb837e Compare January 20, 2025 05:22
@yacovm
Copy link
Contributor

yacovm commented Jan 22, 2025

Dave, FYI - the test is kind of flaky, at least on my laptop. It fails often before the expiration part.
If you merge a fix with this test, it will make the integration tests suite flaky.

Now, to the point - Here is what is happening in the "failing" part of the test (after the certificate expires):

  1. The second peer's identity gets expired and the first peer purges its from its identity store once its certificate expires.
  2. The next time the first peer connects to the second peer, it does so because it remembers its endpoint and its PKI-ID [0].
  3. When a connection is established to the second peer, gossip's communication layer expects to connect to the former PKI-ID. The code has a safeguard that agrees to recognize the change in PKI-ID, as long as the peer (or rather, the endpoint we connect to) didn't change its organization affiliation [1]. However as mentioned before, the peer's membership layer doesn't remember the second peer's organization, only its PKI-ID. It therefore asks the identity store for the organization of the old PKI-ID, but it doesn't have it, as it has purged it from its memory once its x509 certificate has expired. The peer therefore refuses to connect to the new incarnation of the second peer.

A simple fix would be to naively assume that if the peer's identity was purged from the identity store, to take the new identity returned from the handshake:

diff --git a/gossip/comm/comm_impl.go b/gossip/comm/comm_impl.go
index 6a443ae0c..57341bf1f 100644
--- a/gossip/comm/comm_impl.go
+++ b/gossip/comm/comm_impl.go
@@ -173,6 +173,11 @@ func (c *commImpl) createConnection(endpoint string, expectedPKIID common.PKIidT
                                // If the identity isn't present, it's nil - therefore OrgByPeerIdentity would
                                // return nil too and thus would be different than the actual organization
                                identity, _ := c.idMapper.Get(expectedPKIID)
+                               // Identity is no longer present in the identity store, it was probably purged due to being invalidated
+                               if len(identity) == 0 {
+                                       // Optimistically use the new identity of the peer, to merge with the logic flow of this program.
+                                       identity = connInfo.Identity
+                               }
                                oldOrg := c.sa.OrgByPeerIdentity(identity)
                                if !bytes.Equal(actualOrg, oldOrg) {
                                        c.logger.Warning("Remote endpoint claims to be a different peer, expected", expectedPKIID, "but got", pkiID)

This fix, however, would expose the peer to a network spoofing attack in case the identity expires and an adversary routes the communication to the endpoint to itself.

[0] It doesn't remember its organization ID, because... well, when this code was written, we didn't have organizations in Fabric (I feel so old writing this...) and we added organizations to gossip only later on. Since the PKI-ID indirectly is also bound to the organization, it's safe to only use the PKI-ID.

[1] This was put as a safeguard against a network spoofing attack, because the peer gives gossip all TLS CAs across all organizations as a single bundle and not on a per-org basis.

Another fix which I think is more natural is to fail fast in case the identity is no longer present in the identity store, and let the second peer with the renewed identity connect to the first peer instead with its new PKI-ID and identity.

The fix is as follows:

diff --git a/gossip/comm/comm_impl.go b/gossip/comm/comm_impl.go
index 6a443ae0c..f06c1bb84 100644
--- a/gossip/comm/comm_impl.go
+++ b/gossip/comm/comm_impl.go
@@ -162,6 +162,15 @@ func (c *commImpl) createConnection(endpoint string, expectedPKIID common.PKIidT
                return nil, errors.WithStack(err)
        }
 
+
+       if len(expectedPKIID) > 0 {
+               identity, _ := c.idMapper.Get(expectedPKIID)
+               if len(identity) == 0 {
+                       c.logger.Warningf("Identity of %x is no longer found in the identity store, aborting connection", expectedPKIID)
+                       return nil, errors.New("identity no longer recognized")
+               }
+       }
+
        ctx, cancel = context.WithCancel(context.Background())
        if stream, err = cl.GossipStream(ctx); err == nil {
                connInfo, err = c.authenticateRemotePeer(stream, true, false)
diff --git a/integration/gossip/gossip_test.go b/integration/gossip/gossip_test.go
index 5d4c3ce09..ad5365588 100644
--- a/integration/gossip/gossip_test.go
+++ b/integration/gossip/gossip_test.go
@@ -350,7 +350,8 @@ var _ = Describe("Gossip State Transfer and Membership", func() {
                startPeers(nwprocs, false, peer0Org2)
 
                By("ensuring that peer0Org1 replaces peer0Org2 PKI-ID after it expired")
-               Eventually(peer0Org1Runner.Err(), network.EventuallyTimeout).Should(gbytes.Say("changed its PKI-ID from"))
+               time.Sleep(5 * time.Second)
+               network.VerifyMembership(bothPeers, channelName)
        })
 })

@denyeart
Copy link
Contributor Author

@yacovm The last fix you mention sounds reasonable.

I think the integration test is flaky because the time for gossip membership establishment varies greatly, sometimes it is almost instant and sometimes it takes a minute. Do you have any ideas on how to make the membership establishment time quicker or more consistent, either thru a config change or a code change?

@yacovm
Copy link
Contributor

yacovm commented Jan 22, 2025

@yacovm The last fix you mention sounds reasonable.

I think the integration test is flaky because the time for gossip membership establishment varies greatly, sometimes it is almost instant and sometimes it takes a minute. Do you have any ideas on how to make the membership establishment time quicker or more consistent, either thru a config change or a code change?

I think it's something with the nwo, because when it flakes, the logs clearly show that both peers are known, yet it continues querying from some odd reason. Something with the matchers is probably wrong.

It's also weird that only one of the peers is queried in DiscoverPeers. I would expect both of them to be queried... this needs to be checked out.

Copy link
Contributor Author

@denyeart denyeart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yacovm I made progress on the fix and the test based on your suggestions but still a few issues remain...

Comment on lines 337 to 343
//By("verifying membership after cert renewed")
// Discovery check fails since discovered peer's cert is not identical to expected peer's cert (content is the same but signature has changed, perhaps due to how cert is re-created in expireCertificate)
// Eventually(
// nwo.DiscoverPeers(network, peer0Org1, "User1", "testchannel"),
// 60*time.Second,
// 100*time.Millisecond).Should(ContainElements(network.DiscoveredPeer(network.Peer("Org2", "peer0"), "_lifecycle"))
// )
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the weird Discovery check behavior... it appears that discovered peer's cert is not identical to expected peer's cert and therefore the matcher fails. The cert content matches expectations, but signature has changed, perhaps due to how cert is re-created in expireCertificate() below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be that the cert needs to be sanitized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right again @yacovm , I updated the test to force Low-S (sanitized) signatures for the renewed certs and now it works consistently.

Comment on lines +165 to +172
if len(expectedPKIID) > 0 {
identity, _ := c.idMapper.Get(expectedPKIID)
if len(identity) == 0 {
c.logger.Warningf("Identity of %x is no longer found in the identity store, aborting connection", expectedPKIID)
return nil, errors.New("identity no longer recognized")
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came close to fixing the problem... membership is indeed established now, but the new warning and one of the existing warnings keeps getting repeated in the logs forever:

      2025-01-23 00:59:56.269 EST 00a4 WARN [gossip.gossip] disclosurePolicy -> Cannot determine organization of Endpoint: 127.0.0.1:22510, InternalEndpoint: , PKI-ID: fd8548f3b6fb8164a711b6d6dfa2a55f5ae013f512015f884e9af282ef553aa5, Metadata: 
      2025-01-23 00:59:56.273 EST 00a5 WARN [gossip.comm] createConnection -> Identity of 66643835343866336236666238313634613731316236643664666132613535663561653031336635313230313566383834653961663238326566353533616135 is no longer found in the identity store, aborting connection
      2025-01-23 00:59:56.273 EST 00a6 WARN [gossip.comm] sendToEndpoint -> Failed obtaining connection for 127.0.0.1:22510, PKIid:fd8548f3b6fb8164a711b6d6dfa2a55f5ae013f512015f884e9af282ef553aa5 reason: identity no longer recognized
      2025-01-23 00:59:57.778 EST 00a7 WARN [gossip.gossip] func3 -> Unable to determine org of message tag:EMPTY alive_msg:{membership:{endpoint:"127.0.0.1:22510" pki_id:"\xfd\x85H\xf3\xb6\xfb\x81d\xa7\x11\xb6\xd6ߢ\xa5_Z\xe0\x13\xf5\x12\x01_\x88N\x9a\xf2\x82\xefU:\xa5"} timestamp:{inc_num:1737611884416490000 seq_num:28}}
      2025-01-23 01:00:01.780 EST 00a8 WARN [gossip.gossip] func3 -> Unable to determine org of message tag:EMPTY alive_msg:{membership:{endpoint:"127.0.0.1:22510" pki_id:"\xfd\x85H\xf3\xb6\xfb\x81d\xa7\x11\xb6\xd6ߢ\xa5_Z\xe0\x13\xf5\x12\x01_\x88N\x9a\xf2\x82\xefU:\xa5"} timestamp:{inc_num:1737611884416490000 seq_num:28}}
      2025-01-23 01:00:05.781 EST 00a9 WARN [gossip.gossip] func3 -> Unable to determine org of message tag:EMPTY alive_msg:{membership:{endpoint:"127.0.0.1:22510" pki_id:"\xfd\x85H\xf3\xb6\xfb\x81d\xa7\x11\xb6\xd6ߢ\xa5_Z\xe0\x13\xf5\x12\x01_\x88N\x9a\xf2\x82\xefU:\xa5"} timestamp:{inc_num:1737611884416490000 seq_num:28}}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came close to fixing the problem... membership is indeed established now, but the new warning and one of the existing warnings keeps getting repeated in the logs forever:

      2025-01-23 00:59:56.269 EST 00a4 WARN [gossip.gossip] disclosurePolicy -> Cannot determine organization of Endpoint: 127.0.0.1:22510, InternalEndpoint: , PKI-ID: fd8548f3b6fb8164a711b6d6dfa2a55f5ae013f512015f884e9af282ef553aa5, Metadata: 
      2025-01-23 00:59:56.273 EST 00a5 WARN [gossip.comm] createConnection -> Identity of 66643835343866336236666238313634613731316236643664666132613535663561653031336635313230313566383834653961663238326566353533616135 is no longer found in the identity store, aborting connection
      2025-01-23 00:59:56.273 EST 00a6 WARN [gossip.comm] sendToEndpoint -> Failed obtaining connection for 127.0.0.1:22510, PKIid:fd8548f3b6fb8164a711b6d6dfa2a55f5ae013f512015f884e9af282ef553aa5 reason: identity no longer recognized
      2025-01-23 00:59:57.778 EST 00a7 WARN [gossip.gossip] func3 -> Unable to determine org of message tag:EMPTY alive_msg:{membership:{endpoint:"127.0.0.1:22510" pki_id:"\xfd\x85H\xf3\xb6\xfb\x81d\xa7\x11\xb6\xd6ߢ\xa5_Z\xe0\x13\xf5\x12\x01_\x88N\x9a\xf2\x82\xefU:\xa5"} timestamp:{inc_num:1737611884416490000 seq_num:28}}
      2025-01-23 01:00:01.780 EST 00a8 WARN [gossip.gossip] func3 -> Unable to determine org of message tag:EMPTY alive_msg:{membership:{endpoint:"127.0.0.1:22510" pki_id:"\xfd\x85H\xf3\xb6\xfb\x81d\xa7\x11\xb6\xd6ߢ\xa5_Z\xe0\x13\xf5\x12\x01_\x88N\x9a\xf2\x82\xefU:\xa5"} timestamp:{inc_num:1737611884416490000 seq_num:28}}
      2025-01-23 01:00:05.781 EST 00a9 WARN [gossip.gossip] func3 -> Unable to determine org of message tag:EMPTY alive_msg:{membership:{endpoint:"127.0.0.1:22510" pki_id:"\xfd\x85H\xf3\xb6\xfb\x81d\xa7\x11\xb6\xd6ߢ\xa5_Z\xe0\x13\xf5\x12\x01_\x88N\x9a\xf2\x82\xefU:\xa5"} timestamp:{inc_num:1737611884416490000 seq_num:28}}

This happens because the peer with the renewed certificate sends the first peer a membership request ("give me all the peers you know about, alive or dead").

The first peer still has the old incarnation of the second peer in its memory. It then asks itself whether it can disclose knowledge of that peer in its membership response, but it cannot determine that because it no longer knows what is the identity corresponding to the PKI-ID of its previous incarnation.

If you run the peer for long enough (I think up to 2 hours?) it will garbage collect that peer's alive messages and it will no longer print that.

Ideally we would purge everything we know about a peer once its certificate has expired.
However I consider this a separate problem and it's orthogonal to this issue.
The fact that the peer issues warnings in the log doesn't mean something is necessarily wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There exists functionality to update PKI-ID and to purge as can be seen in the regular (non-expired) cert renewal scenario, e.g.:

INFO [gossip.comm] createConnection -> Peer 127.0.0.1:22510 changed its PKI-ID from 3257ef1cb73517a1c07b06c0c8f06903cb537199ffb11ecd23643b900f3f43d2 to 9f3e40a456e9f1a22c73e1206bd78b3b1b0ed7276ffbb37b9de00d23530f0877
INFO [gossip.discovery] purge -> Purging 3257ef1cb73517a1c07b06c0c8f06903cb537199ffb11ecd23643b900f3f43d2 from membership

Why wouldn't we do the same in the case of an expired cert renewal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There exists functionality to update PKI-ID and to purge as can be seen in the regular (non-expired) cert renewal scenario, e.g.:

INFO [gossip.comm] createConnection -> Peer 127.0.0.1:22510 changed its PKI-ID from 3257ef1cb73517a1c07b06c0c8f06903cb537199ffb11ecd23643b900f3f43d2 to 9f3e40a456e9f1a22c73e1206bd78b3b1b0ed7276ffbb37b9de00d23530f0877
INFO [gossip.discovery] purge -> Purging 3257ef1cb73517a1c07b06c0c8f06903cb537199ffb11ecd23643b900f3f43d2 from membership

Why wouldn't we do the same in the case of an expired cert renewal?

Yeah I know the code :-)

I didn't have the time to look whether it's the best option from a code change perspective.

I'll look at it more carefully when I have time

@denyeart denyeart force-pushed the gossip_renew_cert_test branch from f1c1274 to 8f7ec47 Compare January 23, 2025 06:23
@denyeart denyeart marked this pull request as draft January 23, 2025 06:26
@denyeart denyeart force-pushed the gossip_renew_cert_test branch from 8f7ec47 to 490dca6 Compare January 24, 2025 02:23
@denyeart denyeart changed the title Add gossip integration test for renewing expired cert Improve gossip handling of in-memory expired certs Jan 24, 2025
If a peer's cert expires while it is still in gossip memory
for another peer, membership cannot be re-established
after the cert is renewed.

The fix is to acknowledge that the expired cert's peer
is no longer in gossip identity store and accept
new connection with the new PKI-ID and identity.

Signed-off-by: David Enyeart <[email protected]>
@denyeart denyeart force-pushed the gossip_renew_cert_test branch from 490dca6 to 6142f16 Compare January 24, 2025 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gossip doesn't replace PKI-ID of a peer that has renewed an expired certificate
3 participants