-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs #208
Comments
Referring to this SO thread, maybe this is the solution?
When checking Line 30 of
and Referring to the
|
I modified Line 26 and 28 of
and it runs without raising error now =) (I am using Windows 10) |
Thanks! I suspect it's this issue: #151 There's a branch that fixes it, but for some reason, it hasn't been merged into the main branch: https://github.com/allenai/ir_datasets/tree/encoding-fixes I'll look into merging in the changes that have been made since the branch was made and look into pulling it into the main branch. |
It also looks like the As with #209, I recommend just letting ir_datasets do its thing automatically. Or, if you already have a file and don't want to wait for the downloads, follow the instructions provided by the system. |
Just to chime in, we've seen this same issue crop up with the |
Describe the bug
The library was unable to decode byte into character.
Affected dataset(s)
msmarco-passage/dev/small
To Reproduce
Steps to reproduce the behavior:
collectionandqueries.tar.gz
has already been downloaded in the respective dataset folder in~/.ir_datasets
folderExpected behavior
Decoding completes without error.
Additional context
Screenshot:
The text was updated successfully, but these errors were encountered: