Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorder requirements file decoding #12795

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

matthewhughes934
Copy link
Contributor

@matthewhughes934 matthewhughes934 commented Jun 25, 2024

This changes the decoding process to be more in line with what was
previously documented. The new process is outlined in the updated docs.

The auto_decode function was removed and all decoding logic moved to
the pip._internal.req.req_file module because:

  • This function was only ever used to decode requirements file
  • It was never really a generic 'util' function, it was always tied to
    the idiosyncrasies of decoding requirements files.
  • The module lived under _internal so I felt comfortable removing it

A warning was added when we do fallback to using the locale defined
encoding to encourage users to move to an explicit encoding definition
via a coding style comment.

This fixes two existing bugs. Firstly, when:

  • a requirements file is encoded as UTF-8, and
  • some bytes in the file are incompatible with the system locale

Previously, assuming no BOM or PEP-263 style comment, we would default
to using the encoding from the system locale, which would then fail (see
issue #12771)

Secondly, when decoding a file starting with a UTF-32 little endian Byte
Order Marker. Previously this would always fail since
codecs.BOM_UTF32_LE is codecs.BOM_UTF16_LE followed by two null
bytes, and because of the ordering of the list of BOMs we the UTF-16
case would be run first and match the file prefix so we would
incorrectly deduce that the file was UTF-16 little endian encoded. I
can't imagine this is a popular encoding for a requirements file.

Fixes: #12771

@matthewhughes934 matthewhughes934 force-pushed the handle-request-file-decode-failures branch from a3f1cac to aa0f744 Compare June 25, 2024 17:39
@matthewhughes934 matthewhughes934 force-pushed the handle-request-file-decode-failures branch from aa0f744 to 7df3500 Compare June 25, 2024 17:48
@matthewhughes934 matthewhughes934 marked this pull request as ready for review June 25, 2024 18:04
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this probably needs a proper news entry

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please.

b4c3255

@ichard26 ichard26 added this to the 24.3 milestone Jul 16, 2024
Comment on lines 561 to 567
warnings.warn(
f"unable to decode data with {exc.encoding}, falling back to {fallback_encoding}", # noqa: E501
UnicodeWarning,
stacklevel=2,
)
content = raw_content.decode(fallback_encoding)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be ideal to include filename or filepath of the requirements file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be ideal to include filename or filepath of the requirements file.

Agreed, cae26c0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please.

Comment on lines 561 to 619
warnings.warn(
f"unable to decode data with {exc.encoding}, falling back to {fallback_encoding}", # noqa: E501
UnicodeWarning,
stacklevel=2,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with using warnings.warn is that its presentation format is inappropriately technical. logger.warning should be used instead.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with using warnings.warn is that its presentation format is inappropriately technical. logger.warning should be used instead.

I think I just went with this because I knew UnicodeWarning was a thing, happy to go with logging cae26c0

@matthewhughes934 matthewhughes934 force-pushed the handle-request-file-decode-failures branch from 7df3500 to b4c3255 Compare August 15, 2024 17:49
exc.encoding,
fallback_encoding,
)
content = raw_content.decode(fallback_encoding)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be a good idea to use error="backslashreplace" here. Most of the time, the offending bytes would just be a part of a comment anyway and would not make a difference.

Copy link
Contributor Author

@matthewhughes934 matthewhughes934 Aug 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be a good idea to use error="backslashreplace" here. Most of the time, the offending bytes would just be a part of a comment anyway and would not make a difference.

I've been hesitating with this a bit, specifically I'm wondering if this could be abused for nefarious purposes where the contents of the file you 'see' (not well defined, since this is the case where the data won't fully decode) isn't the same contents that pip will process. Though I'm having a hard time finding a vulnerable use-case (something like injecting an extra element or a adding a . to a domain name in a requirement URL)

@sbidoul
Copy link
Member

sbidoul commented Oct 13, 2024

Hmm, since the documentation says it's utf-8 unless there is a PEP-263 style comment, shouldn't we rather decode as utf8 is there is no such comment, and if that fails, fallback to the current locale.getpreferredencoding(False) or sys.getdefaultencoding() with a deprecation warning recommending to add the encoding comment?

That way we have a (more or less) non-breaking path to compliance with the docs?

Also, I'd put all that in auto_decode, with a docstring comment that the function is meant for requirements.txt decoding as per the docs.

@matthewhughes934
Copy link
Contributor Author

Hmm, since the documentation says it's utf-8 unless there is a PEP-263 style comment, shouldn't we rather decode as utf8 is there is no such comment, and if that fails, fallback to the current locale.getpreferredencoding(False) or sys.getdefaultencoding() with a deprecation warning recommending to add the encoding comment?

That way we have a (more or less) non-breaking path to compliance with the docs?

Also, I'd put all that in auto_decode, with a docstring comment that the function is meant for requirements.txt decoding as per the docs.

This sounds reasonable, though I think this change would need to be made in auto_deocde rather than where I made it. But now I am wondering if auto_decode needs to live where it does: it's only ever used for decoding requirements files, so maybe it can just live in req_file

@sbidoul sbidoul removed this from the 24.3 milestone Oct 20, 2024
@sbidoul
Copy link
Member

sbidoul commented Oct 20, 2024

Sounds good. I've removed from the 24.3 milestone. Feel free to ping me when you get back to this.

This changes the decoding process to be more in line with what was
previously documented. The new process is outlined in the updated docs.

The `auto_decode` function was removed and all decoding logic moved to
the `pip._internal.req.req_file` module because:

* This function was only ever used to decode requirements file
* It was never really a generic 'util' function, it was always tied to
  the idiosyncrasies of decoding requirements files.
* The module lived under `_internal` so I felt comfortable removing it

A warning was added when we _do_ fallback to using the locale defined
encoding to encourage users to move to an explicit encoding definition
via a coding style comment.

This fixes two existing bugs. Firstly, when:

* a requirements file is encoded as UTF-8, and
* some bytes in the file are incompatible with the system locale

Previously, assuming no BOM or PEP-263 style comment, we would default
to using the encoding from the system locale, which would then fail (see
issue pypa#12771)

Secondly, when decoding a file starting with a UTF-32 little endian Byte
Order Marker. Previously this would always fail since
`codecs.BOM_UTF32_LE` is `codecs.BOM_UTF16_LE` followed by two null
bytes, and because of the ordering of the list of BOMs we the UTF-16
case would be run first and match the file prefix so we would
incorrectly deduce that the file was UTF-16 little endian encoded. I
can't imagine this is a popular encoding for a requirements file.

Fixes: pypa#12771
@matthewhughes934 matthewhughes934 force-pushed the handle-request-file-decode-failures branch from b4c3255 to d0bf895 Compare October 22, 2024 20:10
@matthewhughes934 matthewhughes934 changed the title Handle req file decode failures on locale encoding Reorder requirements file decoding Oct 22, 2024
@matthewhughes934
Copy link
Contributor Author

matthewhughes934 commented Oct 22, 2024

Sounds good. I've removed from the 24.3 milestone. Feel free to ping me when you get back to this.

👍 I've updated the change and title+description. It was basically a re-do so I just stomped my previous commits.

per the description: I found and fixed another bug while testing this: requirements files starting with a UTF-32 LE BOM would always be decoded as UTF-16 LE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence
4 participants