You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When files are uploaded and chardet detects the encoding type "ascii", the raw bytes are passed to Unicode, Dammit in order to coerce them to UTF-8:
try:
# try to decode the string using the encoding we getdecoded_string=raw_bytes.decode(encoding_type)
except (UnicodeDecodeError, TypeError):
# try unicode dammit if chardet didn't workdammit=UnicodeDammit(raw_bytes)
encoding_type=dammit.original_encodingdecoded_string=raw_bytes.decode(encoding_type)
Unicode, Dammit guesses correctly most of the time, but sometimes it makes mistakes. Sometimes it guesses correctly, but only after a byte-by-byte search of the document that takes a very long time.
Instead, it might be a good idea to feed Unicode, Dammit some "best guesses" as to what the encoding is. Here is what that might look like (along with some slight streamlining of the code):
try:
# Try to decode the string using the encoding we get from chardetdecoded_string=raw_bytes.decode(encoding_type)
except (UnicodeDecodeError, TypeError):
# Try Unicode, Dammit if chardet didn't work, providing some best guesses if chardet detected "ascii"ifencoding_type=="ascii":
dammit=UnicodeDammit(raw_bytes, ["iso-8859-1", "iso-8859-15", "windows-1252"])
else:
dammit=UnicodeDammit(raw_bytes)
decoded_string=dammit.unicode_markup
For the same novel, this gets the job done in 2-3ms! The downside is that we potentially increase the chance that Unicode, Dammit will guess the wrong encoding, but it seems worth the payoff to me.
If we encounter similar bottlenecks with non-ASCII character sets, we could add similar "best guesses" for those.
The text was updated successfully, but these errors were encountered:
When files are uploaded and chardet detects the encoding type "ascii", the raw bytes are passed to Unicode, Dammit in order to coerce them to UTF-8:
For a novel, this can take 15-16 seconds. The explanation is in the BeautifulSoup documentation:
Instead, it might be a good idea to feed Unicode, Dammit some "best guesses" as to what the encoding is. Here is what that might look like (along with some slight streamlining of the code):
For the same novel, this gets the job done in 2-3ms! The downside is that we potentially increase the chance that Unicode, Dammit will guess the wrong encoding, but it seems worth the payoff to me.
If we encounter similar bottlenecks with non-ASCII character sets, we could add similar "best guesses" for those.
The text was updated successfully, but these errors were encountered: