Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-116810: fix memory leak in ssl module #123249

Merged
merged 15 commits into from
Sep 30, 2024

Conversation

jeffvanvoorst
Copy link
Contributor

@jeffvanvoorst jeffvanvoorst commented Aug 23, 2024

PySSL_get_session leaks a session object each time it is called. For programs like cherrypy (cheroot), a leak occurs with each handled web request.

PySSL_get_session leaks a session object each time
it is called.  For programs like cherrypy (cheroot),
a leak occurs with each handled web request.
Copy link

cpython-cla-bot bot commented Aug 23, 2024

All commit authors signed the Contributor License Agreement.
CLA signed

@bedevere-app
Copy link

bedevere-app bot commented Aug 23, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Modules/_ssl.c Outdated
Comment on lines 2856 to 2859
session = SSL_get1_session(self->ssl);
if (session == NULL) {
Py_RETURN_NONE;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging by no test failures, this probably does not have any coverage. Any chance of including the tests in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can check the tests this week. I doubt I will be able to finish that work today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this change still results in returning a valid copy of the ssl session, I doubt tests would fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreeing there. I think it would be difficult to test this. With that being said, are there any existing tests for checking that the session attribute is valid? If not, then this PR should add them (otherwise, I think this looks good).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some life issues come up. I will try to get back to this in the next few weeks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a cacophony of mistakes here on the CPython side. Let me try to correct things:

I don't really understand how this code was nor always a memory leak in the past.

The memory leak seems to have been introduced by 39258d3. _ssl_session_dup returns an SSL_SESSION that needs to be freed by someone and then you all just immediately clobber the variable with something else. But test_ssl.py clearly accesses session. Does CPython have no way to catch memory leaks in its tests? Might I suggest you all look into LeakSanitizer?

Anyway, the problem with that commit is that you all turned #if ... #else ... #endif into including both branches instead of just one.

That commit not only introduced a memory leak, but it also mostly broke your workaround. I haven't tried it, but I suspect you would see this if you uncleanly destroyed the other connection before calling set_session instead of after. Now, I see uncleanly because it seems CPython never followed up on openssl/openssl#1550 (comment) because you all have been misunderstanding the nature of the workaround. It was not working around an OpenSSL bug, but a CPython bug / broken expectation.

Based on my understanding of https://docs.openssl.org/master/man3/SSL_get_session/ I believe this will work, but it does raise the question of why we have the SSL_get0_session + _ssl_session_dup combo in the first place; with modern OpenSSL 3+ why would this SSL_get1_session not just do what we wanted?

That is not what is going on here. There were, as far as I know, no changes in OpenSSL 3+ relating to this.

SSL_get0_session + _ssl_session_dup and SSL_get1_session do not and have never same thing in any version of OpenSSL. SSL_get0_session and SSL_get1_session are the exact same accessor. Just one of them increments the refcount for you and the other doesn't. _ssl_session_dup is a (misguided) CPython function to make a deep copy of the SSL_SESSION object.

Although the code was, prior to the regression, gated on a OPENSSL_VERSION_1_1 ifdef, that does not imply was specific to OpenSSL 1.1.x. OPENSSL_VERSION_1_1 is a CPython-invented macro (that squats OpenSSL's namespace) which means > 1.1. That is, OpenSSL 3 was also OPENSSL_VERSION_1_1:

#if (OPENSSL_VERSION_NUMBER >= 0x10100000L) && !defined(LIBRESSL_VERSION_NUMBER)
#  define OPENSSL_VERSION_1_1 1
#  define PY_OPENSSL_1_1_API 1
#endif

While I have not tested it (have you all?), based on openssl/openssl#1550 (comment), I highly suspect that CPython "needs" the workaround in OpenSSL 3.x too. I say "needs" because...

I think the issue @.davidben was saying Python has in there was describing more the way Python is interacting with the available APIs (which at the time had to span a wide range of OpenSSL versions) and not using some other features instead.

No, that's not what I was saying in openssl/openssl#1550 (comment). See the immediately preceding comment:

FYI, I poked at this briefly and don't see any evidence that SSL_shutdown is being called by Python. It seems the pseudocode in the bug report is wrong and this is a Python bug.

The bug report was very confused because the minimal reproducer that CPython provided did not actually reflect what CPython does. With all the meandering with the bad reproducer, I suspect it was very confusing. But the CPython folks on the bug never followed up, so my past self assumed you all understood, were planning on fixing it, and I didn't have time to check on you all. 😞

What is actually going on here is that OpenSSL will invalidate sessions on connection destruction if you haven't called SSL_shutdown first. To be honest, I don't think this is good behavior, but it has been OpenSSL's intentional behavior since day 1. I believe this is descended from this guidance:

close_notify
This message notifies the recipient that the sender will not send
any more messages on this connection. The session becomes
unresumable if any connection is terminated without proper
close_notify messages with level equal to warning.

https://www.rfc-editor.org/rfc/rfc2246.html#section-7.2.1

Now, I personally think this behavior is unhelpful. If a connection just got closed, that doesn't make the old key material broken or anything. Also if we were getting any security properties out of this, it wouldn't work anyway. There are loads of cases where you cannot reliably knock out the other session.

  • Stateless resumption ("session tickets" in TLS 1.2) cannot possibly invalidate the session on the server
  • Maybe your application serializes the session, e.g., on disk, and then deserializes it later

Interestingly, OpenSSL doesn't actually look for a full bidirectional shutdown to preserve your session. It only looks for whether you attempted to send close_notify. Kind of pointless, to be honest, but IMO this whole thing is pointless.

TLS 1.1, incidentally, already softened this recommendation:

close_notify
This message notifies the recipient that the sender will not send
any more messages on this connection. Note that as of TLS 1.1,
failure to properly close a connection no longer requires that a
session not be resumed. This is a change from TLS 1.0 to conform
with widespread implementation practice.

https://www.rfc-editor.org/rfc/rfc4346.html#section-7.2.1

But OpenSSL retains this requirement, for better or worse. This requirement was not new in 1.1.x, nor was it removed in 3.x. The only reason you all needed to work around it later on is that, due to other quirks of how CPython was using the API (arguably incorrectly), as detailed in my comment on the OpenSSL issue, CPython was sneaking through a gap in this OpenSSL policy. OpenSSL fixed that gap, and now CPython was affected.

Now, contrary to the claim in openssl/openssl#1550 (comment), CPython did not call SSL_shutdown. You all have a separate method to exchange close_notify and weren't calling it in that test.

So, what to do given this? I think you have two options:

  1. Agree with OpenSSL's policy and pass it up to the callers. Without calling SSLSocket.shutdown(), sessions get invalidated. Callers have to call the shutdown method if they care. Fix your tests to reflect this.
  2. Disagree with OpenSSL's policy and effectively turn it off. To do that, call SSL_shutdown when destroying an SSL. Now, that will cause OpenSSL to do I/O, so you might want to SSL_set_quiet_shutdown. Alternatively, there is an SSL_set_shutdown API to just fake a shutdown state.

In either case, you can remove the workaround because this workaround has been self-inflicted from day one.

If you pick option 2, please write some clear comments in the code so the next time this happens, I don't have to write another essay for you all. :-P

The OpenSSL 1.1.1 fork's I do think we should care somewhat about are BoringSSL and it's AWS-LC derivative. @WillChilds-Klein and @davidben can represent what those might need for these specific APIs.

So, per the above, this question is based on several incorrect premises:

  • BoringSSL is not an OpenSSL 1.1.1 fork. We actually diverged from OpenSSL 1.0.2. We just report a version of 1.1.1 because that's what's most convenient for compatibility.
  • This issue has nothing to do with OpenSSL 1.1.1. I expect you all still "need" this workaround in OpenSSL 3.x too.
  • This issue has nothing to do with these specific APIs

Since this "workaround" is really a workaround for CPython's own mishaps, if you all do the thing I suggest above, it should work just fine in BoringSSL.

But also this is all moot in BoringSSL because BoringSSL does not invalidate sessions on SSL_free in the first place. We removed it because it's just kinda silly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where OpenSSL documents their behavior, by the way:

A session will be automatically removed from the session cache and marked as non-resumable if the connection is not closed down cleanly, e.g. if a fatal error occurs on the connection or SSL_shutdown(3) is not called prior to SSL_free(3).

https://docs.openssl.org/master/man3/SSL_get_session/#notes:~:text=A%20session%20will%20be%20automatically%20removed%20from%20the%20session%20cache%20and%20marked%20as%20non%2Dresumable%20if%20the%20connection%20is%20not%20closed%20down%20cleanly%2C%20e.g.%20if%20a%20fatal%20error%20occurs%20on%20the%20connection%20or%20SSL_shutdown(3)%20is%20not%20called%20prior%20to%20SSL_free(3).

For completeness, there were two different bits of spec text that led to this OpenSSL behavior, and I only cited one of them. The bits I cited were about unclean shutdown and removed very early. Fatal error lasted all the way to TLS 1.2:

Error handling in the TLS Handshake protocol is very simple. When an
error is detected, the detecting party sends a message to the other
party. Upon transmission or receipt of a fatal alert message, both
parties immediately close the connection. Servers and clients MUST
forget any session-identifiers, keys, and secrets associated with a
failed connection. Thus, any connection terminated with a fatal
alert MUST NOT be resumed.

https://www.rfc-editor.org/rfc/rfc5246.html#section-7.2.2

It wasn't removed until TLS 1.3, in tlswg/tls13-spec@dc668dc, for the same reasons. In practice, you cannot actually do this consistently, for the reasons I listed above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

the problem with that commit is that you all turned #if ... #else ... #endif into including both branches instead of just one.

Hah, that explains the code smell here. I'd only been glancing at current version of the code in git blame view to try and understand what this was even for without following the individual commits (git blame doing its job and not exposing history of merely deleted lines)... I agree completely on getting rid of that _ssl_session_dup hack. (and more, but we need to examine the SOTW and pick an amenable way out of the past API sins in here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, definitely makes sense to just start with something minimal (probably this PR) to fix the immediate leak. I think probably the thing to do is:

  1. This PR
  2. Remove _ssl_session_dup and confirm tests indeed now fail in OpenSSL 3.x. My expectation is that they'll still fail, but I've no idea if there were other changes that further complicated this mess.
  3. No-op the goofy OpenSSL behavior. Probably something like SSL_set_quiet_shutdown(ssl); SSL_shutdown(ssl); SSL_free(ssl); would work? Confirm the tests now pass again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling SSL_set_shutdown in the destructor sounds fine to me. Is there a reason to believe that the SSL_set_quiet_shutdown(ssl); SSL_shutdown(ssl); dance would be better?

Oh and thanks for the detailed explanation @davidben .

@vstinner
Copy link
Member

I confirm that the change works as expected: it fixes a significant memory leak.

I modified the reproducer to dump the RSS memory usage on Linux:

import os
import socket
import ssl
import gc

host = '142.251.163.99' # some server we can connect with https as example
port = 443

session = None
context = ssl._create_unverified_context(protocol=ssl.PROTOCOL_TLSv1_2)
with socket.create_connection((host, port)) as sock:
    ssock = context.wrap_socket(sock, server_hostname=host, session=session)
    with ssock:
        for i in range(300000):
            session = ssock.session
            if (i % 1000) == 0:
                gc.collect()
                os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")

I can easily reproduce the bug:

VmRSS:	   19840 kB
VmRSS:	   26496 kB
VmRSS:	   33280 kB
VmRSS:	   40064 kB
VmRSS:	   46848 kB
VmRSS:	   53632 kB
VmRSS:	   60416 kB
VmRSS:	   67200 kB
VmRSS:	   74112 kB
VmRSS:	   80896 kB
VmRSS:	   87680 kB
VmRSS:	   94464 kB
VmRSS:	  101248 kB
(...)

With the PR, there is no leak anymore:

VmRSS:	   20436 kB
VmRSS:	   20436 kB
VmRSS:	   20436 kB
VmRSS:	   20436 kB
VmRSS:	   20436 kB
VmRSS:	   20436 kB
VmRSS:	   20436 kB
VmRSS:	   20436 kB
VmRSS:	   20436 kB
VmRSS:	   20436 kB
VmRSS:	   20436 kB
VmRSS:	   20436 kB
VmRSS:	   20436 kB
VmRSS:	   20436 kB
VmRSS:	   20436 kB
(...)

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vstinner vstinner added needs backport to 3.12 bug and security fixes needs backport to 3.13 bugs and security fixes labels Sep 18, 2024
@vstinner
Copy link
Member

@gpshead @pitrou: Do you want to review this change? (I don't recall if @pitrou worked on ssl or not, ignore me if I'm wrong.)

@pitrou
Copy link
Member

pitrou commented Sep 18, 2024

Yes, I worked on ssl, but that was long ago.

Modules/_ssl.c Outdated
Comment on lines 2856 to 2859
session = SSL_get1_session(self->ssl);
if (session == NULL) {
Py_RETURN_NONE;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's possible to test anything, other than manually checking the leak is fixed as @vstinner did.

The issue was apparently that we were first duplicating the session using _ssl_session_dup, but then we immediately forgot the duplicated session by calling SSL_get1_session.

Modules/_ssl.c Outdated
Comment on lines 2856 to 2859
session = SSL_get1_session(self->ssl);
if (session == NULL) {
Py_RETURN_NONE;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, a potential alternate fix would be to remove the SSL_get0_session + _ssl_session_dup dance and keep the SSL_get1_session call. @jeffvanvoorst Can you try to see if it works?

@pitrou pitrou changed the title gh-116810: fixed memory leak in ssl module gh-116810: fix memory leak in ssl module Sep 18, 2024
@gpshead gpshead added extension-modules C modules in the Modules dir topic-SSL labels Sep 19, 2024
@pitrou
Copy link
Member

pitrou commented Sep 19, 2024

Ok, I've tried to removed the session duplication entirely and it works fine locally. I've pushed the change and we can wait for CI.

@pitrou pitrou added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 19, 2024
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pitrou for commit 46126b4 🤖

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 19, 2024
@pitrou pitrou added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 19, 2024
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pitrou for commit 5ced40b 🤖

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 19, 2024
@pitrou
Copy link
Member

pitrou commented Sep 19, 2024

  1. It doesn't leak memory anymore: the fix works as expected.
  2. It looks way faster!

Here as well. The session duplication hack was serializing the session object to ASN1, which is apparently quite expensive (looking at the i2d_SSL_SESSION source code, there's a lot of stuff being serialized).

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I'm not an OpenSSL expert but I trust other developers who were involved in this large change :-)

Copy link
Member

@gpshead gpshead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're still in "2. Disagree with OpenSSL's policy and effectively turn it off." state with this PR as is due to SSL_set_shutdown being used to mark it as "shutdown" so that the OpenSSL implementation does cache it? There's a nice comment at least. :)

My main goal was get rid of the obvious memory leak. I left the set_session code and obviously gross slow serialize/deserialized stuff in place last night because simply removing that without figuring out the shutdown bits did lead to test failures. Thanks for picking that up and doing the rest!

@gpshead
Copy link
Member

gpshead commented Sep 19, 2024

Lets wait until @sethmlarson is back online (currently traveling) and can look at this before merging.

It isn't going to make 3.13.0 as it clearly is not a blocker (given 3.10 and 3.11 have always had this flaw and it was only reported this year). I gather session caching for clients is not super common.

Modules/_ssl.c Outdated Show resolved Hide resolved
Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @davidben for all the explanation, this change LGTM.

@gpshead gpshead merged commit 7e7223e into python:main Sep 30, 2024
37 checks passed
@miss-islington-app
Copy link

Thanks @jeffvanvoorst for the PR, and @gpshead for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Sep 30, 2024
Resolve a memory leak introduced in CPython 3.10's :mod:`ssl` when the :attr:`ssl.SSLSocket.session` property was accessed. Speeds up read and write access to said property by no longer unnecessarily cloning session objects via serialization.

(cherry picked from commit 7e7223e)

Co-authored-by: Jeffrey R. Van Voorst <[email protected]>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Peter Bierma <[email protected]>
Co-authored-by: Gregory P. Smith <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Sep 30, 2024

GH-124800 is a backport of this pull request to the 3.13 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Sep 30, 2024
Resolve a memory leak introduced in CPython 3.10's :mod:`ssl` when the :attr:`ssl.SSLSocket.session` property was accessed. Speeds up read and write access to said property by no longer unnecessarily cloning session objects via serialization.

(cherry picked from commit 7e7223e)

Co-authored-by: Jeffrey R. Van Voorst <[email protected]>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Peter Bierma <[email protected]>
Co-authored-by: Gregory P. Smith <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Sep 30, 2024
@bedevere-app
Copy link

bedevere-app bot commented Sep 30, 2024

GH-124801 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 bug and security fixes label Sep 30, 2024
gpshead added a commit that referenced this pull request Sep 30, 2024
gh-116810: fix memory leak in ssl module (GH-123249)

Resolve a memory leak introduced in CPython 3.10's :mod:`ssl` when the :attr:`ssl.SSLSocket.session` property was accessed. Speeds up read and write access to said property by no longer unnecessarily cloning session objects via serialization.

(cherry picked from commit 7e7223e)

Co-authored-by: Jeffrey R. Van Voorst <[email protected]>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Peter Bierma <[email protected]>
Co-authored-by: Gregory P. Smith <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
@jeffvanvoorst
Copy link
Contributor Author

Thanks everyone. I appreciate the cpython community support.

vstinner pushed a commit that referenced this pull request Oct 7, 2024
gh-116810: fix memory leak in ssl module (GH-123249)

Resolve a memory leak introduced in CPython 3.10's :mod:`ssl` when the :attr:`ssl.SSLSocket.session` property was accessed. Speeds up read and write access to said property by no longer unnecessarily cloning session objects via serialization.

(cherry picked from commit 7e7223e)

Co-authored-by: Jeffrey R. Van Voorst <[email protected]>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Peter Bierma <[email protected]>
Co-authored-by: Gregory P. Smith <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extension-modules C modules in the Modules dir topic-SSL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants