UTF-8 coercion is slow #1068

scottkleinman · 2022-01-28T18:48:51Z

When files are uploaded and chardet detects the encoding type "ascii", the raw bytes are passed to Unicode, Dammit in order to coerce them to UTF-8:

try:
    # try to decode the string using the encoding we get
    decoded_string = raw_bytes.decode(encoding_type)
except (UnicodeDecodeError, TypeError):
    # try unicode dammit if chardet didn't work
    dammit = UnicodeDammit(raw_bytes)
    encoding_type = dammit.original_encoding
    decoded_string = raw_bytes.decode(encoding_type)

For a novel, this can take 15-16 seconds. The explanation is in the BeautifulSoup documentation:

Unicode, Dammit guesses correctly most of the time, but sometimes it makes mistakes. Sometimes it guesses correctly, but only after a byte-by-byte search of the document that takes a very long time.

Instead, it might be a good idea to feed Unicode, Dammit some "best guesses" as to what the encoding is. Here is what that might look like (along with some slight streamlining of the code):

try:
    # Try to decode the string using the encoding we get from chardet
    decoded_string = raw_bytes.decode(encoding_type)
except (UnicodeDecodeError, TypeError):
    # Try Unicode, Dammit if chardet didn't work, providing some best guesses if chardet detected "ascii"
    if encoding_type == "ascii":
        dammit = UnicodeDammit(raw_bytes, ["iso-8859-1", "iso-8859-15", "windows-1252"])
    else:
        dammit = UnicodeDammit(raw_bytes)
    decoded_string = dammit.unicode_markup

For the same novel, this gets the job done in 2-3ms! The downside is that we potentially increase the chance that Unicode, Dammit will guess the wrong encoding, but it seems worth the payoff to me.

If we encounter similar bottlenecks with non-ASCII character sets, we could add similar "best guesses" for those.

scottkleinman added the enhancement label Jan 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 coercion is slow #1068

UTF-8 coercion is slow #1068

scottkleinman commented Jan 28, 2022

UTF-8 coercion is slow #1068

UTF-8 coercion is slow #1068

Comments

scottkleinman commented Jan 28, 2022