Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089

Closed
vstinner opened this issue Oct 19, 2023 · 22 comments
Closed
Assignees
Labels
topic-C-API type-feature A feature request or enhancement

Comments

@vstinner
Copy link
Member

vstinner commented Oct 19, 2023

I propose to change the PyUnicode_AsUTF8() API to raise an exception and return NULL if the string contains embedded null characters.

If the string contains an embedded null character, the UTF-8 encoded string can be truncated if used with C functions using char* since a null byte is treated as the terminator: marker of the string end. Truncating a string silently is a bad practice and can lead to different bugs including security vulnerabilities.

In practice, the minority of impacted C extensions and impacted users should benefit of such backward incompatible change, since truncating a string silently is a bad practice. Impacted users can use PyUnicode_AsUTF8AndSize(obj, NULL) and just ignore the size if they want to truncate on purpose.

It would address the following "hidden" comment on PyUnicode_AsUTF8():

Use of this API is DEPRECATED since no size information can be
extracted from the returned data.

PyUnicode_AsUTF8String() is part of the limited C API, whereas PyUnicode_AsUTF8() is not.

In the recently added PyUnicode_EqualToUTF8(obj, str), str is treated as not equal if obj contains embedded null characters.

The folllowing functions already raise an exception if the string contains embedded null characters or bytes:

  • PyUnicode_AsWideCharString()
  • PyUnicode_EncodeLocale()
  • PyUnicode_EncodeFSDefault()
  • PyUnicode_DecodeLocale(), PyUnicode_DecodeLocaleAndSize()
  • PyUnicode_DecodeFSDefaultAndSize()
  • PyUnicode_FSConverter()
  • PyUnicode_FSDecoder()

PyUnicode_AsUTF8String() returns a bytes object and so the length, so it doesn't raise the exception.

PyUnicode_AsUTF8AndSize() also returns the size and so don't raise on embedded null characters.

Linked PRs

vstinner added a commit to vstinner/cpython that referenced this issue Oct 19, 2023
* PyUnicode_AsUTF8() now raises an exception if the string contains
  embedded null characters.
* PyUnicode_AsUTF8AndSize() now sets `*size` to 0 on error to avoid
  undefined variable value.
* Update related C API tests (test_capi.test_unicode).
vstinner added a commit to vstinner/cpython that referenced this issue Oct 19, 2023
* PyUnicode_AsUTF8() now raises an exception if the string contains
  embedded null characters.
* PyUnicode_AsUTF8AndSize() now sets `*size` to 0 on error to avoid
  undefined variable value.
* Update related C API tests (test_capi.test_unicode).
* type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently
  truncate doc containing null bytes.
@vstinner vstinner self-assigned this Oct 19, 2023
vstinner added a commit to vstinner/cpython that referenced this issue Oct 20, 2023
On error, PyUnicode_AsUTF8AndSize() now sets the size argument to 0,
to avoid undefined value.
vstinner added a commit to vstinner/cpython that referenced this issue Oct 20, 2023
* PyUnicode_AsUTF8() now raises an exception if the string contains
  embedded null characters.
* Update related C API tests (test_capi.test_unicode).
* type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently
  truncate doc containing null bytes.
vstinner added a commit to vstinner/cpython that referenced this issue Oct 20, 2023
On error, PyUnicode_AsUTF8AndSize() now sets the size argument to -1,
to avoid undefined value.
vstinner added a commit that referenced this issue Oct 20, 2023
* PyUnicode_AsUTF8() now raises an exception if the string contains
  embedded null characters.
* Update related C API tests (test_capi.test_unicode).
* type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently
  truncate doc containing null bytes.

Co-authored-by: Serhiy Storchaka <[email protected]>
vstinner added a commit to vstinner/cpython that referenced this issue Oct 20, 2023
Add PyUnicode_AsUTF8() function to the limited C API.

multiprocessing posixshmem now uses PyUnicode_AsUTF8() instead of
PyUnicode_AsUTF8AndSize(): the extension is built with the limited C
API. The function now raises an exception if the filename contains an
embedded null character instead of truncating silently the filename.
vstinner added a commit to vstinner/cpython that referenced this issue Oct 20, 2023
PyUnicode_AsUTF8() now raises an exception if the string contains
embedded null characters.
vstinner added a commit that referenced this issue Oct 20, 2023
Add PyUnicode_AsUTF8() function to the limited C API.

multiprocessing posixshmem now uses PyUnicode_AsUTF8() instead of
PyUnicode_AsUTF8AndSize(): the extension is built with the limited C
API. The function now raises an exception if the filename contains an
embedded null character instead of truncating silently the filename.
vstinner added a commit that referenced this issue Oct 20, 2023
On error, PyUnicode_AsUTF8AndSize() now sets the size argument to -1,
to avoid undefined value.
vstinner added a commit that referenced this issue Oct 20, 2023
PyUnicode_AsUTF8() now raises an exception if the string contains
embedded null characters.
@vstinner
Copy link
Member Author

I added PyUnicode_AsUTF8() to the limited C API and it now raises an exception if the string contains embedded null characters.

@encukou
Copy link
Member

encukou commented Oct 23, 2023

This makes PyUnicode_AsUTF8 O(N), as it needs to scan the entire string. I don't think that's a change to make lightly.
I also strongly believe that it's not a good fit for the limited API. The better alternative -- PyUnicode_AsUTF8AndSize is already there.

@encukou encukou reopened this Oct 23, 2023
@vstinner
Copy link
Member Author

This makes PyUnicode_AsUTF8 O(N), as it needs to scan the entire string.

If it's a performance issue, we can cache the "contains embedded null characters/bytes" information in PyASCIIObject structure.

I also strongly believe that it's not a good fit for the limited API.

Would you mind to elaborate?

The better alternative -- PyUnicode_AsUTF8AndSize is already there.

It's a different API. In C, char* with NUL terminator is used everywhere. Embedded NUL bytes is truncating strings which caused many bugs in the past.

In C, it's rare that C code uses a length with a char* string.

@encukou
Copy link
Member

encukou commented Oct 28, 2023

This makes PyUnicode_AsUTF8 O(N), as it needs to scan the entire string.
If it's a performance issue, we can cache the "contains embedded null characters/bytes" information in PyASCIIObject structure.

OK! Will you do that?
What are the performance trade-offs of that solution?

I also strongly believe that it's not a good fit for the limited API.

Would you mind to elaborate?

The API that existed with this name since before Python 3.0 is -- as you claim -- unsafe. The new semantics are not proven; usually we wait before the API stabilizes before declaring it stable for the long term.

vstinner added a commit to vstinner/cpython that referenced this issue Nov 1, 2023
Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove
the explicit check for embedded null characters.

The change avoids to have to include explicitly <string.h> to get the
strlen() function when using a recent version of the limited C API.
vstinner added a commit to vstinner/cpython that referenced this issue Nov 1, 2023
Add PyASCIIObject.state.embed_null member to Python str objects. It
is used as a cache by PyUnicode_AsUTF8() to only check once if a
string contains a null character. Strings created by
PyUnicode_FromString() initializes *embed_null* since the string
cannot contain a null character.

Global static strings now also initialize the *embed_null* member.
The chr(0) singleton ("\0" string) is the only static string which
contains a null character.
vstinner added a commit to vstinner/cpython that referenced this issue Nov 1, 2023
Add PyASCIIObject.state.embed_null member to Python str objects. It
is used as a cache by PyUnicode_AsUTF8() to only check once if a
string contains a null character. Strings created by
PyUnicode_FromString() initializes *embed_null* since the string
cannot contain a null character.

Global static strings now also initialize the *embed_null* member.
The chr(0) singleton ("\0" string) is the only static string which
contains a null character.
vstinner added a commit to vstinner/cpython that referenced this issue Nov 1, 2023
Add PyASCIIObject.state.embed_null member to Python str objects. It
is used as a cache by PyUnicode_AsUTF8() to only check once if a
string contains a null character. Strings created by
PyUnicode_FromString() initializes *embed_null* since the string
cannot contain a null character.

Global static strings now also initialize the *embed_null* member.
The chr(0) singleton ("\0" string) is the only static string which
contains a null character.
vstinner added a commit to vstinner/cpython that referenced this issue Nov 1, 2023
vstinner added a commit to vstinner/cpython that referenced this issue Nov 7, 2023
Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null
characters: the PyUnicode_AsUTF8Safe() function should be used
instead.
vstinner added a commit to vstinner/cpython that referenced this issue Nov 7, 2023
vstinner added a commit to vstinner/cpython that referenced this issue Nov 7, 2023
vstinner added a commit to vstinner/cpython that referenced this issue Nov 7, 2023
vstinner added a commit to vstinner/cpython that referenced this issue Nov 7, 2023
vstinner added a commit to vstinner/cpython that referenced this issue Nov 7, 2023
vstinner added a commit that referenced this issue Nov 7, 2023
* Revert "gh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (#111585)"

This reverts commit d9b606b.

* Revert "gh-111089: Use PyUnicode_AsUTF8() in getargs.c (#111620)"

This reverts commit cde1071.

* Revert "gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (#111091)"

This reverts commit d731579.

* Revert "gh-111089: Add PyUnicode_AsUTF8() to the limited C API (#111121)"

This reverts commit d8f32be.

* Revert "gh-111089: Use PyUnicode_AsUTF8() in sqlite3 (#111122)"

This reverts commit 37e4e20.
@vstinner
Copy link
Member Author

vstinner commented Nov 7, 2023

There seems to be multiple disagreements on multiple things around this issue:

  • Disagreement if truncating a string on embedded null character is a bug, the expected behavior, or it's just ok we don't care.
  • Disagreement if truncating a string can lead to security issue or not -- see issue Embedded null characters can lead to bugs or even security vulnerabilities #111656 discussion.
  • Disagreement if PyUnicode_AsUTF8() should reject or not embedded null characters.
  • Disagreement if a new function rejecting embedded null characters is needed.
  • Disagreement if anything should be done, since C developers must be aware of security, must be aware of issues with embedded null characters, truncated is not the only security concern so why caring about this one anyway, etc.
  • Disagreement on who take decisions: @encukou is now asking the C API Working Group to take the decision, but the working group doesn't exist yet.

When I made these changes, they were approved by my peers and I thought that we were all on the same page on embedded null characters, since Python is already rejecting them in many places such as PyUnicode_AsWideCharString() for many years. But well, then read the discussions for details about disagreements.

I reverted all changes (except of PR #111106 which is unrelated) in PR #111833.

I close the issue for now.

@vstinner vstinner closed this as completed Nov 7, 2023
@gpshead
Copy link
Member

gpshead commented Nov 8, 2023

For reference I did attempt a test run of our huge "everything" style codebase at work over the weekend with AsUTF8 and AsUTF8AndSize(p, NULL) both set to return an error upon an embedded '\0' character. That test attempt didn't get far (auto-cancelled before it could run more due to the # of failures). Too many things were failing to even build because several Python tools used at build time triggered the error. https://github.com/eliben/pyelftools/ and https://github.com/fonttools/fonttools/ at least, I didn't try to construct a list and am not going to test further.

That various things depend on the current (and now restored, thank you!) behavior, regardless of what any of us think of that, was what I wanted it to show. It did. :)

@vstinner
Copy link
Member Author

vstinner commented Nov 8, 2023

because several Python tools used at build time triggered the error. https://github.com/eliben/pyelftools/ and https://github.com/fonttools/fonttools/ at least

Any idea what is their use case for embedded null characters? How are such strings build? By decoding bytes from UTF-8?

@vstinner
Copy link
Member Author

vstinner commented Nov 8, 2023

Thanks for that feedback/test, it's very useful 👍

hugovk pushed a commit to hugovk/cpython that referenced this issue Nov 8, 2023
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)"

This reverts commit d9b606b.

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)"

This reverts commit cde1071.

* Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)"

This reverts commit d731579.

* Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)"

This reverts commit d8f32be.

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)"

This reverts commit 37e4e20.
@encukou
Copy link
Member

encukou commented Nov 8, 2023

Thank you for the revert. Sorry that the issues were only discovered after merge, and after such heated discussion. Maybe there's some process improvement to be made.

I filed this issue for the C-API-WG-to-be: capi-workgroup/api-evolution#39
Feel free to re-frame the issue or propose other solutions.

aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
…n#111091)

* PyUnicode_AsUTF8() now raises an exception if the string contains
  embedded null characters.
* Update related C API tests (test_capi.test_unicode).
* type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently
  truncate doc containing null bytes.

Co-authored-by: Serhiy Storchaka <[email protected]>
aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
…111121)

Add PyUnicode_AsUTF8() function to the limited C API.

multiprocessing posixshmem now uses PyUnicode_AsUTF8() instead of
PyUnicode_AsUTF8AndSize(): the extension is built with the limited C
API. The function now raises an exception if the filename contains an
embedded null character instead of truncating silently the filename.
aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
…#111106)

On error, PyUnicode_AsUTF8AndSize() now sets the size argument to -1,
to avoid undefined value.
aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
PyUnicode_AsUTF8() now raises an exception if the string contains
embedded null characters.
aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
…1585)

Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove
the explicit check for embedded null characters.

The change avoids to have to include explicitly <string.h> to get the
strlen() function when using a recent version of the limited C API.
aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove
the explicit check for embedded null characters.
aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)"

This reverts commit d9b606b.

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)"

This reverts commit cde1071.

* Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)"

This reverts commit d731579.

* Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)"

This reverts commit d8f32be.

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)"

This reverts commit 37e4e20.
Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024
…n#111091)

* PyUnicode_AsUTF8() now raises an exception if the string contains
  embedded null characters.
* Update related C API tests (test_capi.test_unicode).
* type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently
  truncate doc containing null bytes.

Co-authored-by: Serhiy Storchaka <[email protected]>
Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024
…111121)

Add PyUnicode_AsUTF8() function to the limited C API.

multiprocessing posixshmem now uses PyUnicode_AsUTF8() instead of
PyUnicode_AsUTF8AndSize(): the extension is built with the limited C
API. The function now raises an exception if the filename contains an
embedded null character instead of truncating silently the filename.
Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024
…#111106)

On error, PyUnicode_AsUTF8AndSize() now sets the size argument to -1,
to avoid undefined value.
Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024
PyUnicode_AsUTF8() now raises an exception if the string contains
embedded null characters.
Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024
…1585)

Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove
the explicit check for embedded null characters.

The change avoids to have to include explicitly <string.h> to get the
strlen() function when using a recent version of the limited C API.
Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024
Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove
the explicit check for embedded null characters.
Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)"

This reverts commit d9b606b.

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)"

This reverts commit cde1071.

* Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)"

This reverts commit d731579.

* Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)"

This reverts commit d8f32be.

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)"

This reverts commit 37e4e20.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-C-API type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants