Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 coercion is slow #1068

Open
scottkleinman opened this issue Jan 28, 2022 · 0 comments
Open

UTF-8 coercion is slow #1068

scottkleinman opened this issue Jan 28, 2022 · 0 comments

Comments

@scottkleinman
Copy link
Contributor

When files are uploaded and chardet detects the encoding type "ascii", the raw bytes are passed to Unicode, Dammit in order to coerce them to UTF-8:

try:
    # try to decode the string using the encoding we get
    decoded_string = raw_bytes.decode(encoding_type)
except (UnicodeDecodeError, TypeError):
    # try unicode dammit if chardet didn't work
    dammit = UnicodeDammit(raw_bytes)
    encoding_type = dammit.original_encoding
    decoded_string = raw_bytes.decode(encoding_type)

For a novel, this can take 15-16 seconds. The explanation is in the BeautifulSoup documentation:

Unicode, Dammit guesses correctly most of the time, but sometimes it makes mistakes. Sometimes it guesses correctly, but only after a byte-by-byte search of the document that takes a very long time.

Instead, it might be a good idea to feed Unicode, Dammit some "best guesses" as to what the encoding is. Here is what that might look like (along with some slight streamlining of the code):

try:
    # Try to decode the string using the encoding we get from chardet
    decoded_string = raw_bytes.decode(encoding_type)
except (UnicodeDecodeError, TypeError):
    # Try Unicode, Dammit if chardet didn't work, providing some best guesses if chardet detected "ascii"
    if encoding_type == "ascii":
        dammit = UnicodeDammit(raw_bytes, ["iso-8859-1", "iso-8859-15", "windows-1252"])
    else:
        dammit = UnicodeDammit(raw_bytes)
    decoded_string = dammit.unicode_markup

For the same novel, this gets the job done in 2-3ms! The downside is that we potentially increase the chance that Unicode, Dammit will guess the wrong encoding, but it seems worth the payoff to me.

If we encounter similar bottlenecks with non-ASCII character sets, we could add similar "best guesses" for those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant