Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New widget : cleansing data #6640

Closed
simonaubertbd opened this issue Nov 18, 2023 · 4 comments
Closed

New widget : cleansing data #6640

simonaubertbd opened this issue Nov 18, 2023 · 4 comments
Assignees

Comments

@simonaubertbd
Copy link

What's your use case?
More than once, the data I work on contains null values, or unwanted space at the end of fields, and sometimes a full field is empty or there totally empty rows in the middle of the data set. There aren't a lot of case but it's happen so many times a widget to automatize that would be great.

What's your proposed solution?
A widget with the main cleaning operations. Somthing like that :
image

Are there any alternative solutions?
using on all the concerned workflows several widgets to process the data.

@wvdvegte
Copy link

Have you checked out the widgets Impute, Unique and Preprocess?

@simonaubertbd
Copy link
Author

hello @wvdvegte This time, yes. Also the preprocess for text. But it's not exactly the same thing since here, you can also clean strings, choose on which fields to apply, etc. The idea is more to have a better data quality.

Best regards,

Simon

@wvdvegte
Copy link

There's also a lot you can do using the Formula widget and any Python code that fits into a one-line variable assignment, e.g. removing (leading, trailing or all) spaces, case modifications and many other things as long as it doesn't require external libraries or multiple lines of codes. For inexperienced programmers like me, AI chatbots can very effectively be used to generate such code.
Anyway, if the typical cleansing actions that you refer to appear to be universal, it might indeed be a good idea to unify them in a new widget, or add them to one of the preprocessing widgets,

@ajdapretnar
Copy link
Contributor

@simonaubertbd My first impression is your task can be achieved with some combination of existing widgets. Admittedly, for some specifics, you would indeed need Python Script, particularly for text handling.
Case by case:

  • remove null data (Purge and Select Rows)
  • replace nulls (Impute)
  • remove unwanted characters (text preprocess)
    As for text, you can set what you are transforming with the Corpus widget.

If nothing else, such as widget is more text-specific than general Orange. I need to be convinced of its general applicability first. At the moment, it seems specific for you own workflow.

@janezd janezd closed this as completed Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants