Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase robustness of is_split_into_words check: resolves ValueError #39

Merged
merged 4 commits into from
Oct 31, 2023

Conversation

tomaarsen
Copy link
Owner

Closes #38, Closes #35

Hello!

Pull Request overview

  • Increase robustness of is_split_into_words check

Details

In particular, before this PR, the tokenizer would use the first text in the prediction tokens to determine if we were dealing with: a string sentence, a list of string sentences, a sentence as a list of words, or a list of sentences each as a list of words. However, if the first sentence is just a single word, but you're passing a list of sentences, e.g. ['Avolon', 'Walmart - Milwaukee, WI'] or ["Unknown", "Unknown", "Unknown", "Sentence 1", "Sentence 2"], then it would consider the inputs to be a single sentence.

With this PR, the tokenizer will see them as a list of sentences.

However, do watch out for the following case: a list of sentences where each sentence is a single word. This is indistinguishable from a sentence as a list of words, and it will (perhaps unexpectedly) consider the text as the latter. I don't see a good solution for this, apart from splitting predict up into predict and predict_single or something.

  • Tom Aarsen

@tomaarsen tomaarsen merged commit eede2a4 into main Oct 31, 2023
8 checks passed
@tomaarsen tomaarsen deleted the hotfix/is_split_into_words branch October 31, 2023 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant