Byte-fallback tokens are not detokenized properly #457

lecoqnicolas · 2025-01-15T16:34:44Z

Hello,

I have been producing a French-Chinese model a few months ago, and noticed that byte-fallback was yielding strange triplets of BF tokens... then, lately, to debug both Chinese and Japanese, I looked out, and realized triplets are the usual byte-encoding for CJK characters.

Except that UTF bytes start with \u or \x, not <0x as the byte-fallback sentencepiece tokens do. So a) I tried tweaking the shared_vocabulary within the concerned packages, which did no good upon encoding... and b) I remembered a former commit around the decoder : 543c50e

Turns out, if I revert to former version, out-of-vocabulary characters function correctly.

As of the active underscores, I have tried

    def decode(self, tokens: List[str]) -> str:
#        detokenized = "".join(tokens)
#        return detokenized.replace("▁", " ")
        return self.lazy_processor().decode_pieces(tokens).replace("_", " ")

and it does not bug so far.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Byte-fallback tokens are not detokenized properly #457

Byte-fallback tokens are not detokenized properly #457

lecoqnicolas commented Jan 15, 2025 •

edited

Loading

Byte-fallback tokens are not detokenized properly #457

Byte-fallback tokens are not detokenized properly #457

Comments

lecoqnicolas commented Jan 15, 2025 • edited Loading

lecoqnicolas commented Jan 15, 2025 •

edited

Loading