Preprocess text widget should remove stop-words #455

PrimozGodec · 2019-09-30T14:43:43Z

0.7.3dev

3.24.0dev

Preprocess text widget should remove stop-words from the text for the Slovene language.
Stopwords languages should be ordered in alphabetical order.

It does not happen when stopwords from NLTK are used - the problem is spaces beside the stopwords.

Read some data with Import documents and connect with Preprocess Text widget. Observe the output of Preprocess Text widget.

ajdapretnar · 2019-10-01T08:07:58Z

@PrimozGodec So in essence NLTK has the wrong stopwords? 😱

PrimozGodec · 2019-10-01T09:50:42Z

They have additional space after each stopword. It at least holds for the Slovene language. The fix for this is in #456

PrimozGodec · 2019-10-01T10:19:41Z

I opened the issue on NLTK nltk/nltk_data#139, anyway I think #456 can be a quick fix. I think it is an OK solution since it is linear.

PrimozGodec added the bug label Sep 30, 2019

PrimozGodec mentioned this issue Sep 30, 2019

[FIX] Fix stopwords filtering #456

Merged

3 tasks

ajdapretnar closed this as completed in #456 Oct 1, 2019

Provide feedback