Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge Data merges columns I don't want merged #4077

Closed
janezd opened this issue Oct 3, 2019 · 2 comments · Fixed by #4100
Closed

Merge Data merges columns I don't want merged #4077

janezd opened this issue Oct 3, 2019 · 2 comments · Fixed by #4100
Assignees

Comments

@janezd
Copy link
Contributor

janezd commented Oct 3, 2019

This is the first nasty thing of the type I feared when removing Variable.make.

File -> kMeans -> Merge Data. Connect another File -> kMeans and then to the same merge data. In Merge data, set the key to "Row Index". On the output from Merge data I want to have heart_disease data with two additional columns with cluster labels, so I can compare clusterings.

If we decide that, no, it should keep two columns, it would also duplicate all other columns (age, max HR...).

Options:

  1. Revert removal of Variable.make.
  2. Do nothing. The user has to rename the columns with duplicated names.
  3. Check whether the columns with same names have the same data. If so, keep a single column (and show info?). If they are different, use both columns with renaming as introduced in [FIX] Merge data: Rename variables with duplicated names #4076.
  4. Check whether the columns with same names have the same data. If so, keep a single one. If not, show an error and let the user do the renaming.

My 4 cents:

  1. No. This problem is small in comparison with those caused by Variable.make. Something much worse must happen to reintroduce it
  2. No. The user needs to know why (s)he doesn't have two columns.
  3. Yes.
  4. Probably no. I see no good reason for it. Let us not annoy the user if the widget can do the job reasonably good. The user can still rename is (s)he wants more informative names if (s)he chooses to. Besides, option 3 is already almost implemented in [FIX] Merge data: Rename variables with duplicated names #4076, we just need to add checking the columns. If we go for 4, we'd discard [FIX] Merge data: Rename variables with duplicated names #4076, which would be a shame.

This problem was not caused by #4076. #4076 just didn't (and couldn't have) fixed it.

@janezd janezd added bug A bug confirmed by the core team needs discussion Core developers need to discuss the issue and removed bug A bug confirmed by the core team labels Oct 3, 2019
@janezd
Copy link
Contributor Author

janezd commented Oct 11, 2019

We go for option 3.

@janezd janezd removed the needs discussion Core developers need to discuss the issue label Oct 11, 2019
@janezd
Copy link
Contributor Author

janezd commented Oct 18, 2019

@VesnaT, I wrote some tests.

Note that with outer join (which is exactly the situation you've drawn on the paper), the tables are equivalent, so it makes sense to keep both columns. This is what the first if in var_needed does: if there are any rows from the right table that are concatenated at the bottom, all right attributes are kept. Without this, outer join could add rows that would come from the right table but contain only columns from the left table -- so all this rows would have just nans.

@janezd janezd self-assigned this Oct 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant