-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disallowed words, boundaries, case #41
Comments
The word tokenization issue is a bit more complicated than I said, specially if you want it to be useful for all languages. The most important thing is that the word tokenization used here must be exactly the same used when generating the disallowed words list. (In addition, if different kinds of apostrophes are converted in the blacklist generation, the same must be done here when extracting the sentences.) I see these possibilities:
I would choose option 2. The list of separators could be a configuration option for each language. Hyphens and apostrophes should be considered separators only when they are next to other separators. To implement this, separators should be multicharacter strings, not just single characters. Tell me what you decide, and I will redo the blacklist accordingly. |
We need to clarify the issue of word tokenization. I will make a concrete proposal. Add to the language configuration file a setting with replacements to be made (in order) before splitting sentences at white spaces. To be general enough, the replacements must be regular expressions. Apostrophe replacements can be included here. For example:
These replacements should be made exactly the same way when extracting sentences and when creating the blacklist. The words in the blacklist should be case-sensitive. Or at least make it optional:
|
Looking again into the word tokenization issue, I find that the problem was not |
https://github.com/Common-Voice/common-voice-wiki-scraper/blob/a23abced7713c2260f78fc77252727fe719d6eca/src/checker.rs#L37
Here you split words just around white spaces. You should use word boundaries instead (in regexp: \b, or something equivalent like common separators). Otherwise, the word is not detected in many contexts. For example, I have the word
oster-monath
in the disallowed words file, but in a sentence it appears between quotation marks ("oster-monath"
) or near a comma (oster-monath,
) and it is not detected.https://github.com/Common-Voice/common-voice-wiki-scraper/blob/a23abced7713c2260f78fc77252727fe719d6eca/src/checker.rs#L42
Case is very important for spelling in most languages. I think the disallowed words should be case-sensitive. Case-insensitive is used sometimes in NLP, but not in spell-checking! It could be optional for each language.
The text was updated successfully, but these errors were encountered: