Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocess text widget should remove stop-words #455

Closed
PrimozGodec opened this issue Sep 30, 2019 · 3 comments · Fixed by #456
Closed

Preprocess text widget should remove stop-words #455

PrimozGodec opened this issue Sep 30, 2019 · 3 comments · Fixed by #456
Labels

Comments

@PrimozGodec
Copy link
Collaborator

Text version

0.7.3dev

Orange version

3.24.0dev

Expected behavior
  • Preprocess text widget should remove stop-words from the text for the Slovene language.
  • Stopwords languages should be ordered in alphabetical order.
Actual behavior

It does not happen when stopwords from NLTK are used - the problem is spaces beside the stopwords.

Steps to reproduce the behavior

Read some data with Import documents and connect with Preprocess Text widget. Observe the output of Preprocess Text widget.

Additional info (worksheets, data, screenshots, ...)
@ajdapretnar
Copy link
Collaborator

@PrimozGodec So in essence NLTK has the wrong stopwords? 😱

@PrimozGodec
Copy link
Collaborator Author

They have additional space after each stopword. It at least holds for the Slovene language. The fix for this is in #456

@PrimozGodec
Copy link
Collaborator Author

I opened the issue on NLTK nltk/nltk_data#139, anyway I think #456 can be a quick fix. I think it is an OK solution since it is linear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants