Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs #208

Open
yuenherny opened this issue Sep 2, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@yuenherny
Copy link

Describe the bug
The library was unable to decode byte into character.

Affected dataset(s)

  • msmarco-passage/dev/small

To Reproduce
Steps to reproduce the behavior:

  1. Make sure collectionandqueries.tar.gz has already been downloaded in the respective dataset folder in ~/.ir_datasets folder
  2. Run:
import ir_datasets
train = ir_datasets.load('msmarco-passage/dev/small')
for doc in train.docs_iter():
    doc
  1. Wait for it to run, and you will see an error:
[INFO] [starting] fixing encoding
[INFO] [finished] fixing encoding: [07:07] [3.06GB] [7.16MB/s]
                                            
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
d:\Repos\XpressAI\vecto-reranking\1 - Dataset Exploration.ipynb Cell 6 in <cell line: 1>()
----> [1](vscode-notebook-cell:/d%3A/Repos/XpressAI/vecto-reranking/1%20-%20Dataset%20Exploration.ipynb#W5sZmlsZQ%3D%3D?line=0) for doc in train.docs_iter():
      [2](vscode-notebook-cell:/d%3A/Repos/XpressAI/vecto-reranking/1%20-%20Dataset%20Exploration.ipynb#W5sZmlsZQ%3D%3D?line=1)     doc

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\__init__.py:147, in DocstoreSplitter.__next__(self)
    146 def __next__(self):
--> 147     return next(self.it)

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\formats\tsv.py:92, in TsvIter.__next__(self)
     91 def __next__(self):
---> 92     line = next(self.line_iter)
     93     cols = line.rstrip('\n').split('\t')
     94     num_cols = len(self.cls._fields)

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\formats\tsv.py:30, in FileLineIter.__next__(self)
     28         self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc.stream()))
     29 while self.pos < self.start:
---> 30     line = self.stream.readline()
     31     if line != '\n':
     32         self.pos += 1

File ~\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
     22 def decode(self, input, final=False):
---> 23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined>

Expected behavior
Decoding completes without error.

Additional context
Screenshot:
image

@yuenherny yuenherny added the bug Something isn't working label Sep 2, 2022
@yuenherny
Copy link
Author

The symbol — was all over the place in collections.tsv. Maybe this is what causing the error?
image

@yuenherny
Copy link
Author

yuenherny commented Sep 2, 2022

Referring to this SO thread, maybe this is the solution?

In Windows, the default encoding is cp1252, but that readme file is most likely encoded in UTF8.

The error message tells you that cp1252 codec is unable to decode the character with the byte 0x9D. When I browsed through the readme file, I found this character: ” (also known as: "RIGHT DOUBLE QUOTATION MARK"), which has the bytes 0xE2 0x80 0x9D, which includes the problematic byte.

From:

with open('README.txt') as file:
    long_description = file.read()

Change into:

with open('README.txt', encoding="utf8") as file:
    long_description = file.read()

This will open the file with the proper encoding.

When checking Line 30 of ir_datasets\formats\tsv.py, I found this line:

line = self.stream.readline()

and self.stream is using the io.TextIOWrapper instance.

Referring to the io docs here:

The default encoding of TextIOWrapper and open() is locale-specific (locale.getpreferredencoding(False)).

However, many developers forget to specify the encoding when opening text files encoded in UTF-8 (e.g. JSON, TOML, Markdown, etc…) since most Unix platforms use UTF-8 locale by default. This causes bugs because the locale encoding is not UTF-8 for most Windows users.
...
Accordingly, it is highly recommended that you specify the encoding explicitly when opening text files. If you want to use UTF-8, pass encoding="utf-8". To use the current locale encoding, encoding="locale" is supported in Python 3.10.

@yuenherny
Copy link
Author

yuenherny commented Sep 2, 2022

I modified Line 26 and 28 of ir_datasets\formats\tsv.py in my venv - adding encoding="utf-8" as an argument in io.TextIOWrapper:

def __next__(self):
        ...
        if self.stream is None:
            if isinstance(self.dlc, list):
                self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc[self.stream_idx].stream()), encoding="utf-8")
            else:
                self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc.stream()), encoding="utf-8")
        while self.pos < self.start:
            line = self.stream.readline()
            if line != '\n':
                self.pos += 1
       ...

and it runs without raising error now =) (I am using Windows 10)
I would be pleased to open a PR on this, if the collaborators don't mind.

@seanmacavaney
Copy link
Collaborator

Thanks! I suspect it's this issue: #151

There's a branch that fixes it, but for some reason, it hasn't been merged into the main branch: https://github.com/allenai/ir_datasets/tree/encoding-fixes

I'll look into merging in the changes that have been made since the branch was made and look into pulling it into the main branch.

@seanmacavaney
Copy link
Collaborator

It also looks like the FixEncoding module was bypassed, which is why you're getting all the characters like —. (FixEncoding replaces them with their correct unicode versions.)

As with #209, I recommend just letting ir_datasets do its thing automatically. Or, if you already have a file and don't want to wait for the downloads, follow the instructions provided by the system.

@davidjurgens
Copy link

Just to chime in, we've seen this same issue crop up with the irds:nfcorpus/dev dataset too. @seanmacavaney is there any updated on getting the encoding fix branched merged? I only ask because I assigned my class a homework involving this dataset and now students who use Windows are reporting not being able to load it without error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants