UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs #208

yuenherny · 2022-09-02T21:51:01Z

Describe the bug
The library was unable to decode byte into character.

Affected dataset(s)

msmarco-passage/dev/small

To Reproduce
Steps to reproduce the behavior:

Make sure collectionandqueries.tar.gz has already been downloaded in the respective dataset folder in ~/.ir_datasets folder
Run:

import ir_datasets
train = ir_datasets.load('msmarco-passage/dev/small')
for doc in train.docs_iter():
    doc

Wait for it to run, and you will see an error:

[INFO] [starting] fixing encoding
[INFO] [finished] fixing encoding: [07:07] [3.06GB] [7.16MB/s]
                                            
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
d:\Repos\XpressAI\vecto-reranking\1 - Dataset Exploration.ipynb Cell 6 in <cell line: 1>()
----> [1](vscode-notebook-cell:/d%3A/Repos/XpressAI/vecto-reranking/1%20-%20Dataset%20Exploration.ipynb#W5sZmlsZQ%3D%3D?line=0) for doc in train.docs_iter():
      [2](vscode-notebook-cell:/d%3A/Repos/XpressAI/vecto-reranking/1%20-%20Dataset%20Exploration.ipynb#W5sZmlsZQ%3D%3D?line=1)     doc

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\__init__.py:147, in DocstoreSplitter.__next__(self)
    146 def __next__(self):
--> 147     return next(self.it)

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\formats\tsv.py:92, in TsvIter.__next__(self)
     91 def __next__(self):
---> 92     line = next(self.line_iter)
     93     cols = line.rstrip('\n').split('\t')
     94     num_cols = len(self.cls._fields)

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\formats\tsv.py:30, in FileLineIter.__next__(self)
     28         self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc.stream()))
     29 while self.pos < self.start:
---> 30     line = self.stream.readline()
     31     if line != '\n':
     32         self.pos += 1

File ~\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
     22 def decode(self, input, final=False):
---> 23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined>

Expected behavior
Decoding completes without error.

Additional context
Screenshot:

The text was updated successfully, but these errors were encountered:

yuenherny · 2022-09-02T22:00:47Z

The symbol â€” was all over the place in collections.tsv. Maybe this is what causing the error?

yuenherny · 2022-09-02T22:08:17Z

Referring to this SO thread, maybe this is the solution?

In Windows, the default encoding is cp1252, but that readme file is most likely encoded in UTF8.

The error message tells you that cp1252 codec is unable to decode the character with the byte 0x9D. When I browsed through the readme file, I found this character: ” (also known as: "RIGHT DOUBLE QUOTATION MARK"), which has the bytes 0xE2 0x80 0x9D, which includes the problematic byte.

From:
with open('README.txt') as file:
    long_description = file.read()
Change into:
with open('README.txt', encoding="utf8") as file:
    long_description = file.read()
This will open the file with the proper encoding.

When checking Line 30 of ir_datasets\formats\tsv.py, I found this line:

line = self.stream.readline()

and self.stream is using the io.TextIOWrapper instance.

Referring to the io docs here:

The default encoding of TextIOWrapper and open() is locale-specific (locale.getpreferredencoding(False)).

However, many developers forget to specify the encoding when opening text files encoded in UTF-8 (e.g. JSON, TOML, Markdown, etc…) since most Unix platforms use UTF-8 locale by default. This causes bugs because the locale encoding is not UTF-8 for most Windows users.
...
Accordingly, it is highly recommended that you specify the encoding explicitly when opening text files. If you want to use UTF-8, pass encoding="utf-8". To use the current locale encoding, encoding="locale" is supported in Python 3.10.

yuenherny · 2022-09-02T22:32:32Z

I modified Line 26 and 28 of ir_datasets\formats\tsv.py in my venv - adding encoding="utf-8" as an argument in io.TextIOWrapper:

def __next__(self):
        ...
        if self.stream is None:
            if isinstance(self.dlc, list):
                self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc[self.stream_idx].stream()), encoding="utf-8")
            else:
                self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc.stream()), encoding="utf-8")
        while self.pos < self.start:
            line = self.stream.readline()
            if line != '\n':
                self.pos += 1
       ...

and it runs without raising error now =) (I am using Windows 10)
I would be pleased to open a PR on this, if the collaborators don't mind.

seanmacavaney · 2022-09-03T08:46:44Z

Thanks! I suspect it's this issue: #151

There's a branch that fixes it, but for some reason, it hasn't been merged into the main branch: https://github.com/allenai/ir_datasets/tree/encoding-fixes

I'll look into merging in the changes that have been made since the branch was made and look into pulling it into the main branch.

seanmacavaney · 2022-09-03T08:51:20Z

It also looks like the FixEncoding module was bypassed, which is why you're getting all the characters like â€”. (FixEncoding replaces them with their correct unicode versions.)

As with #209, I recommend just letting ir_datasets do its thing automatically. Or, if you already have a file and don't want to wait for the downloads, follow the instructions provided by the system.

davidjurgens · 2022-09-25T03:02:03Z

Just to chime in, we've seen this same issue crop up with the irds:nfcorpus/dev dataset too. @seanmacavaney is there any updated on getting the encoding fix branched merged? I only ask because I assigned my class a homework involving this dataset and now students who use Windows are reporting not being able to load it without error.

yuenherny added the bug Something isn't working label Sep 2, 2022

Ataago mentioned this issue Apr 17, 2023

Progress bar doesnt show up on the rise presentation TF4ces/TF4ces-search-engine#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs #208

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs #208

yuenherny commented Sep 2, 2022

yuenherny commented Sep 2, 2022

yuenherny commented Sep 2, 2022 •

edited

Loading

yuenherny commented Sep 2, 2022 •

edited

Loading

seanmacavaney commented Sep 3, 2022

seanmacavaney commented Sep 3, 2022

davidjurgens commented Sep 25, 2022

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs #208

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs #208

Comments

yuenherny commented Sep 2, 2022

yuenherny commented Sep 2, 2022

yuenherny commented Sep 2, 2022 • edited Loading

yuenherny commented Sep 2, 2022 • edited Loading

seanmacavaney commented Sep 3, 2022

seanmacavaney commented Sep 3, 2022

davidjurgens commented Sep 25, 2022

yuenherny commented Sep 2, 2022 •

edited

Loading

yuenherny commented Sep 2, 2022 •

edited

Loading