Merge Data merges columns I don't want merged #4077

janezd · 2019-10-03T15:00:43Z

This is the first nasty thing of the type I feared when removing Variable.make.

File -> kMeans -> Merge Data. Connect another File -> kMeans and then to the same merge data. In Merge data, set the key to "Row Index". On the output from Merge data I want to have heart_disease data with two additional columns with cluster labels, so I can compare clusterings.

Before [FIX] Merge data: Rename variables with duplicated names #4076 I added an Edit Domain to rename column in one of the tables to "Clusters 2"
After [FIX] Merge data: Rename variables with duplicated names #4076 I don't use Edit Domains and I expected to have columns "Clusters (1)" and "Clusters (2)". This however does not happen because attributes are now matched by name and type, so Merge data believes that both tables have the same attribute "Clusters", and it doesn't duplicate it. It takes the column from the first table and ignores the one from the second.

If we decide that, no, it should keep two columns, it would also duplicate all other columns (age, max HR...).

Options:

~~Revert removal of Variable.make.~~
Do nothing. The user has to rename the columns with duplicated names.
Check whether the columns with same names have the same data. If so, keep a single column (and show info?). If they are different, use both columns with renaming as introduced in [FIX] Merge data: Rename variables with duplicated names #4076.
Check whether the columns with same names have the same data. If so, keep a single one. If not, show an error and let the user do the renaming.

My 4 cents:

No. This problem is small in comparison with those caused by Variable.make. Something much worse must happen to reintroduce it
No. The user needs to know why (s)he doesn't have two columns.
Yes.
Probably no. I see no good reason for it. Let us not annoy the user if the widget can do the job reasonably good. The user can still rename is (s)he wants more informative names if (s)he chooses to. Besides, option 3 is already almost implemented in [FIX] Merge data: Rename variables with duplicated names #4076, we just need to add checking the columns. If we go for 4, we'd discard [FIX] Merge data: Rename variables with duplicated names #4076, which would be a shame.

This problem was not caused by #4076. #4076 just didn't (and couldn't have) fixed it.

The text was updated successfully, but these errors were encountered:

janezd · 2019-10-11T07:43:23Z

We go for option 3.

janezd · 2019-10-18T14:53:35Z

@VesnaT, I wrote some tests.

Note that with outer join (which is exactly the situation you've drawn on the paper), the tables are equivalent, so it makes sense to keep both columns. This is what the first if in var_needed does: if there are any rows from the right table that are concatenated at the bottom, all right attributes are kept. Without this, outer join could add rows that would come from the right table but contain only columns from the left table -- so all this rows would have just nans.

janezd added bug A bug confirmed by the core team needs discussion Core developers need to discuss the issue and removed bug A bug confirmed by the core team labels Oct 3, 2019

janezd mentioned this issue Oct 11, 2019

[ENH] MergeData: Don't remove duplicate columns with different data #4100

Merged

3 tasks

janezd removed the needs discussion Core developers need to discuss the issue label Oct 11, 2019

janezd self-assigned this Oct 23, 2019

VesnaT closed this as completed in #4100 Oct 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge Data merges columns I don't want merged #4077

Merge Data merges columns I don't want merged #4077

janezd commented Oct 3, 2019

janezd commented Oct 11, 2019

janezd commented Oct 18, 2019

Merge Data merges columns I don't want merged #4077

Merge Data merges columns I don't want merged #4077

Comments

janezd commented Oct 3, 2019

janezd commented Oct 11, 2019

janezd commented Oct 18, 2019