Disallowed words, boundaries, case #41

jaumeortola · 2019-08-02T22:13:12Z

https://github.com/Common-Voice/common-voice-wiki-scraper/blob/a23abced7713c2260f78fc77252727fe719d6eca/src/checker.rs#L37

Here you split words just around white spaces. You should use word boundaries instead (in regexp: \b, or something equivalent like common separators). Otherwise, the word is not detected in many contexts. For example, I have the word oster-monath in the disallowed words file, but in a sentence it appears between quotation marks ("oster-monath") or near a comma (oster-monath, ) and it is not detected.

https://github.com/Common-Voice/common-voice-wiki-scraper/blob/a23abced7713c2260f78fc77252727fe719d6eca/src/checker.rs#L42

Case is very important for spelling in most languages. I think the disallowed words should be case-sensitive. Case-insensitive is used sometimes in NLP, but not in spell-checking! It could be optional for each language.

The text was updated successfully, but these errors were encountered:

jaumeortola · 2019-08-04T10:46:33Z

The word tokenization issue is a bit more complicated than I said, specially if you want it to be useful for all languages.

The most important thing is that the word tokenization used here must be exactly the same used when generating the disallowed words list. (In addition, if different kinds of apostrophes are converted in the blacklist generation, the same must be done here when extracting the sentences.)

I see these possibilities:

Use only white spaces as a separator. The blacklist generation will be more complex, and it will require more language expertise, but it is still feasible.
Use white spaces plus a few other common separators: comma, semicolon, quotation marks, question marks..., but not hyphens and apostrophes.
Use word boundaries (\b in regular expressions). This is probably too greedy and not good for some languages.

I would choose option 2. The list of separators could be a configuration option for each language.

Hyphens and apostrophes should be considered separators only when they are next to other separators. To implement this, separators should be multicharacter strings, not just single characters.

Tell me what you decide, and I will redo the blacklist accordingly.

jaumeortola · 2019-08-06T22:11:34Z

We need to clarify the issue of word tokenization. I will make a concrete proposal.

Add to the language configuration file a setting with replacements to be made (in order) before splitting sentences at white spaces. To be general enough, the replacements must be regular expressions. Apostrophe replacements can be included here. For example:

replacements = [
     [ "’", "'"]
     [ "[,;\?!\.]", " "]
     [ "'[,;\?!\.]", " "]
     [ "^['-]", ""]
]

These replacements should be made exactly the same way when extracting sentences and when creating the blacklist.

The words in the blacklist should be case-sensitive. Or at least make it optional:

blacklist_caseinsensitive = false

jaumeortola · 2019-08-08T09:58:46Z

Looking again into the word tokenization issue, I find that the problem was not "oster-monath"or oster-monath, but l'"oster-monath", which is not detected if the word in the blacklist is just oster-monath.
Anyway, this kind of problems will be solved using, as I said, exactly the same tokenization and the same replacements when creating the blacklist and when selecting sentences.

jaumeortola mentioned this issue Aug 2, 2019

rules & disallowed words for Catalan #42

Merged

nukeador added the enhancement New feature or request label Aug 5, 2019

jaumeortola mentioned this issue Aug 6, 2019

Using a dictionary to generate blacklists #43

Closed

MichaelKohler added discussion extract-improvements labels Jan 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disallowed words, boundaries, case #41

Disallowed words, boundaries, case #41

jaumeortola commented Aug 2, 2019 •

edited

Loading

jaumeortola commented Aug 4, 2019

jaumeortola commented Aug 6, 2019

jaumeortola commented Aug 8, 2019

Disallowed words, boundaries, case #41

Disallowed words, boundaries, case #41

Comments

jaumeortola commented Aug 2, 2019 • edited Loading

jaumeortola commented Aug 4, 2019

jaumeortola commented Aug 6, 2019

jaumeortola commented Aug 8, 2019

jaumeortola commented Aug 2, 2019 •

edited

Loading