[C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089

vstinner · 2023-10-19T23:28:24Z

I propose to change the PyUnicode_AsUTF8() API to raise an exception and return NULL if the string contains embedded null characters.

If the string contains an embedded null character, the UTF-8 encoded string can be truncated if used with C functions using char* since a null byte is treated as the terminator: marker of the string end. Truncating a string silently is a bad practice and can lead to different bugs including security vulnerabilities.

In practice, the minority of impacted C extensions and impacted users should benefit of such backward incompatible change, since truncating a string silently is a bad practice. Impacted users can use PyUnicode_AsUTF8AndSize(obj, NULL) and just ignore the size if they want to truncate on purpose.

It would address the following "hidden" comment on PyUnicode_AsUTF8():

Use of this API is DEPRECATED since no size information can be
extracted from the returned data.

PyUnicode_AsUTF8String() is part of the limited C API, whereas PyUnicode_AsUTF8() is not.

In the recently added PyUnicode_EqualToUTF8(obj, str), str is treated as not equal if obj contains embedded null characters.

The folllowing functions already raise an exception if the string contains embedded null characters or bytes:

PyUnicode_AsWideCharString()
PyUnicode_EncodeLocale()
PyUnicode_EncodeFSDefault()
PyUnicode_DecodeLocale(), PyUnicode_DecodeLocaleAndSize()
PyUnicode_DecodeFSDefaultAndSize()
PyUnicode_FSConverter()
PyUnicode_FSDecoder()

PyUnicode_AsUTF8String() returns a bytes object and so the length, so it doesn't raise the exception.

PyUnicode_AsUTF8AndSize() also returns the size and so don't raise on embedded null characters.

Linked PRs

The text was updated successfully, but these errors were encountered:

* PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * PyUnicode_AsUTF8AndSize() now sets `*size` to 0 on error to avoid undefined variable value. * Update related C API tests (test_capi.test_unicode).

* PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * PyUnicode_AsUTF8AndSize() now sets `*size` to 0 on error to avoid undefined variable value. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes.

On error, PyUnicode_AsUTF8AndSize() now sets the size argument to 0, to avoid undefined value.

* PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes.

On error, PyUnicode_AsUTF8AndSize() now sets the size argument to -1, to avoid undefined value.

* PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes. Co-authored-by: Serhiy Storchaka <[email protected]>

Add PyUnicode_AsUTF8() function to the limited C API. multiprocessing posixshmem now uses PyUnicode_AsUTF8() instead of PyUnicode_AsUTF8AndSize(): the extension is built with the limited C API. The function now raises an exception if the filename contains an embedded null character instead of truncating silently the filename.

PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters.

Add PyUnicode_AsUTF8() function to the limited C API. multiprocessing posixshmem now uses PyUnicode_AsUTF8() instead of PyUnicode_AsUTF8AndSize(): the extension is built with the limited C API. The function now raises an exception if the filename contains an embedded null character instead of truncating silently the filename.

On error, PyUnicode_AsUTF8AndSize() now sets the size argument to -1, to avoid undefined value.

PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters.

vstinner · 2023-10-20T18:06:03Z

I added PyUnicode_AsUTF8() to the limited C API and it now raises an exception if the string contains embedded null characters.

encukou · 2023-10-23T09:39:59Z

This makes PyUnicode_AsUTF8 O(N), as it needs to scan the entire string. I don't think that's a change to make lightly.
I also strongly believe that it's not a good fit for the limited API. The better alternative -- PyUnicode_AsUTF8AndSize is already there.

vstinner · 2023-10-27T15:19:53Z

This makes PyUnicode_AsUTF8 O(N), as it needs to scan the entire string.

If it's a performance issue, we can cache the "contains embedded null characters/bytes" information in PyASCIIObject structure.

I also strongly believe that it's not a good fit for the limited API.

Would you mind to elaborate?

The better alternative -- PyUnicode_AsUTF8AndSize is already there.

It's a different API. In C, char* with NUL terminator is used everywhere. Embedded NUL bytes is truncating strings which caused many bugs in the past.

In C, it's rare that C code uses a length with a char* string.

encukou · 2023-10-28T10:24:09Z

This makes PyUnicode_AsUTF8 O(N), as it needs to scan the entire string.
If it's a performance issue, we can cache the "contains embedded null characters/bytes" information in PyASCIIObject structure.

OK! Will you do that?
What are the performance trade-offs of that solution?

I also strongly believe that it's not a good fit for the limited API.

Would you mind to elaborate?

The API that existed with this name since before Python 3.0 is -- as you claim -- unsafe. The new semantics are not proven; usually we wait before the API stabilizes before declaring it stable for the long term.

Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove the explicit check for embedded null characters. The change avoids to have to include explicitly <string.h> to get the strlen() function when using a recent version of the limited C API.

Add PyASCIIObject.state.embed_null member to Python str objects. It is used as a cache by PyUnicode_AsUTF8() to only check once if a string contains a null character. Strings created by PyUnicode_FromString() initializes *embed_null* since the string cannot contain a null character. Global static strings now also initialize the *embed_null* member. The chr(0) singleton ("\0" string) is the only static string which contains a null character.

Fix unicode_subtype_new

Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null characters: the PyUnicode_AsUTF8Safe() function should be used instead.

…ython#111585)" This reverts commit d9b606b.

…111620)" This reverts commit cde1071.

python#111091)" This reverts commit d731579.

…ython#111121)" This reverts commit d8f32be.

…1122)" This reverts commit 37e4e20.

* Revert "gh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (#111585)" This reverts commit d9b606b. * Revert "gh-111089: Use PyUnicode_AsUTF8() in getargs.c (#111620)" This reverts commit cde1071. * Revert "gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (#111091)" This reverts commit d731579. * Revert "gh-111089: Add PyUnicode_AsUTF8() to the limited C API (#111121)" This reverts commit d8f32be. * Revert "gh-111089: Use PyUnicode_AsUTF8() in sqlite3 (#111122)" This reverts commit 37e4e20.

vstinner · 2023-11-07T22:36:38Z

There seems to be multiple disagreements on multiple things around this issue:

Disagreement if truncating a string on embedded null character is a bug, the expected behavior, or it's just ok we don't care.
Disagreement if truncating a string can lead to security issue or not -- see issue Embedded null characters can lead to bugs or even security vulnerabilities #111656 discussion.
Disagreement if PyUnicode_AsUTF8() should reject or not embedded null characters.
Disagreement if a new function rejecting embedded null characters is needed.
Disagreement if anything should be done, since C developers must be aware of security, must be aware of issues with embedded null characters, truncated is not the only security concern so why caring about this one anyway, etc.
Disagreement on who take decisions: @encukou is now asking the C API Working Group to take the decision, but the working group doesn't exist yet.

When I made these changes, they were approved by my peers and I thought that we were all on the same page on embedded null characters, since Python is already rejecting them in many places such as PyUnicode_AsWideCharString() for many years. But well, then read the discussions for details about disagreements.

I reverted all changes (except of PR #111106 which is unrelated) in PR #111833.

I close the issue for now.

gpshead · 2023-11-08T01:20:37Z

For reference I did attempt a test run of our huge "everything" style codebase at work over the weekend with AsUTF8 and AsUTF8AndSize(p, NULL) both set to return an error upon an embedded '\0' character. That test attempt didn't get far (auto-cancelled before it could run more due to the # of failures). Too many things were failing to even build because several Python tools used at build time triggered the error. https://github.com/eliben/pyelftools/ and https://github.com/fonttools/fonttools/ at least, I didn't try to construct a list and am not going to test further.

That various things depend on the current (and now restored, thank you!) behavior, regardless of what any of us think of that, was what I wanted it to show. It did. :)

vstinner · 2023-11-08T08:52:50Z

because several Python tools used at build time triggered the error. https://github.com/eliben/pyelftools/ and https://github.com/fonttools/fonttools/ at least

Any idea what is their use case for embedded null characters? How are such strings build? By decoding bytes from UTF-8?

vstinner · 2023-11-08T08:54:07Z

Thanks for that feedback/test, it's very useful 👍

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)" This reverts commit d9b606b. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)" This reverts commit cde1071. * Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)" This reverts commit d731579. * Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)" This reverts commit d8f32be. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)" This reverts commit 37e4e20.

encukou · 2023-11-08T12:50:28Z

Thank you for the revert. Sorry that the issues were only discovered after merge, and after such heated discussion. Maybe there's some process improvement to be made.

I filed this issue for the C-API-WG-to-be: capi-workgroup/api-evolution#39
Feel free to re-frame the issue or propose other solutions.

…n#111091) * PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes. Co-authored-by: Serhiy Storchaka <[email protected]>

…111121) Add PyUnicode_AsUTF8() function to the limited C API. multiprocessing posixshmem now uses PyUnicode_AsUTF8() instead of PyUnicode_AsUTF8AndSize(): the extension is built with the limited C API. The function now raises an exception if the filename contains an embedded null character instead of truncating silently the filename.

…#111106) On error, PyUnicode_AsUTF8AndSize() now sets the size argument to -1, to avoid undefined value.

PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters.

…1585) Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove the explicit check for embedded null characters. The change avoids to have to include explicitly <string.h> to get the strlen() function when using a recent version of the limited C API.

Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove the explicit check for embedded null characters.