Increase robustness of `is_split_into_words` check: resolves ValueError #39

tomaarsen · 2023-10-31T08:41:48Z

Closes #38, Closes #35

Hello!

Pull Request overview

Increase robustness of is_split_into_words check

Details

In particular, before this PR, the tokenizer would use the first text in the prediction tokens to determine if we were dealing with: a string sentence, a list of string sentences, a sentence as a list of words, or a list of sentences each as a list of words. However, if the first sentence is just a single word, but you're passing a list of sentences, e.g. ['Avolon', 'Walmart - Milwaukee, WI'] or ["Unknown", "Unknown", "Unknown", "Sentence 1", "Sentence 2"], then it would consider the inputs to be a single sentence.

With this PR, the tokenizer will see them as a list of sentences.

However, do watch out for the following case: a list of sentences where each sentence is a single word. This is indistinguishable from a sentence as a list of words, and it will (perhaps unexpectedly) consider the text as the latter. I don't see a good solution for this, apart from splitting predict up into predict and predict_single or something.

Tom Aarsen

…o hotfix/is_split_into_words

tomaarsen added 4 commits October 31, 2023 09:26

Increase robustness of "is_split_into_words" check

c7ab967

Update the changelog

e508aed

Add a test case

e238264

Merge branch 'main' of https://github.com/tomaarsen/SpanMarkerNER int…

19361fc

…o hotfix/is_split_into_words

tomaarsen merged commit eede2a4 into main Oct 31, 2023
8 checks passed

tomaarsen deleted the hotfix/is_split_into_words branch October 31, 2023 09:30

tomaarsen mentioned this pull request Oct 31, 2023

should return same no. of list as of inputs #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase robustness of `is_split_into_words` check: resolves ValueError #39

Increase robustness of `is_split_into_words` check: resolves ValueError #39

tomaarsen commented Oct 31, 2023

Increase robustness of is_split_into_words check: resolves ValueError #39

Increase robustness of is_split_into_words check: resolves ValueError #39

Conversation

tomaarsen commented Oct 31, 2023

Pull Request overview

Details

Increase robustness of `is_split_into_words` check: resolves ValueError #39

Increase robustness of `is_split_into_words` check: resolves ValueError #39