Token offsets computation fails when input is truncated #82

Valahaar · 2022-05-13T10:23:12Z

Describe the bug
Title. In classy/data/dataset/hf/classification.py#L89 we invoke self.tokenize (#L109) which correctly truncates the input.

The issue arises due to tuple(tok_encoding.word_to_tokens(wi)) for wi in range(len(tokens)): when a token is not included in the input due to truncation, word_to_tokens returns None, and tuple(None) raises a TypeError, which triggers the catch condition and makes the function return None, which cannot be unpacked in input_ids, token_offsets = self.tokenize(token_sample.tokens), resulting in another unhandled exception that finally crashes classy.

To Reproduce
In the token classification setting, input a sentence that has too many tokens (or reduce truncation to obtain the same effect).

Expected behaviour
I think there is a way to know how many of the original tokens were kept, and we can iterate over them instead of len(tokens), otherwise we can just iterate until word_to_tokens(wi) is not None. Comments?

The text was updated successfully, but these errors were encountered:

Valahaar added the bug Something isn't working label May 13, 2022

Valahaar assigned poccio May 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token offsets computation fails when input is truncated #82

Token offsets computation fails when input is truncated #82

Valahaar commented May 13, 2022