Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature requests: when linked by ID, 1) allow cross-dataset visualization, and 2) merge datasets on ID #2529

Open
janeadams opened this issue Jan 23, 2025 · 0 comments

Comments

@janeadams
Copy link

Describe the problem
Say I have two datasets with an ID, and I want to visualize a 2D scatter of some measure in two different experiments. Currently, even though I have linked the datasets by ID, I cannot accomplish this in either of the following ways:

  • I can't keep the two datasets separate and drag them both to the same chart, because glue doesn't allow a chart to rely on more than one dataset
  • I can't select both datasets and choose "merge", because even though they are linked by ID, they become merged by index. I know this because of the chart below, which shows that the rank-order of the measures of these genes in each experiment is the same, which I know is not the case. I could "force" the merge to work correctly by sorting the datasets ahead of time by ID, but this is a shaky solution because it only works if we assume that the exact same genes are in both datasets and do not exist in only one dataset.

Image

Describe the solution you'd like:
The chart above should look like this instead:

Image

Describe alternatives you've considered:
I wrote the following code to merge all my datasets on my ID before bringing it into glue as a single dataset. This isn't a general solution because it involves traversing a file system to find the correct files, but could be generalized within glue using dataset selections. Note that I have adapted this code from my use case so it is more like pseudo-code; I haven't run this specific version.

id_to_link_on = 'my_id'

dfs = []

for file in files:
    df = pd.read_csv(os.path.join('data', folder, file))
    df.rename(columns={a: f'{a}_{file}' for a in df.columns}, inplace=True)
    dfs.append(df)

merged = dfs[0]

for i, df in enumerate(dfs[1:]):
    merged = merged.merge(df,
      left_on=id_to_link_on,
      right_on=id_to_link_on)

merged = merged.set_index(id_to_link_on)

This would be a broadly useful tool for anyone trying to visualize measures for the same entities across datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant