Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset consistency #229

Closed
ymahlich opened this issue Oct 11, 2024 · 3 comments
Closed

Dataset consistency #229

ymahlich opened this issue Oct 11, 2024 · 3 comments

Comments

@ymahlich
Copy link
Collaborator

Downloading the data from figshare I realized that some of the dataset files are tab separated, some comma separated, some compressed, some uncompressed.

Is there a specific reason for that? I understand that everything is handled internally so as long as I am directly interacting with the data through coderdata objects, it might not be important but maybe we want to be consistent about this?

As far as I can tell, currently the "schema" behind what datatype is which format is as follows:

data type csv csv.gz tsv.gz
copy_number x
drug_descriptor x
drugs x
experiments x
mutations x
proteomics x
samples x
transcriptomics x
@sgosline
Copy link
Member

Yes! First off, pandas can read in whatever so it doesn't matter. I default to csv unless it has drug information, which requires tabs (because drug descriptors and names have commas and quotes in them). we could probably gzip the samples files for consistency.

@jjacobson95
Copy link
Collaborator

I like that we can preview the samples files on figshare when not gzipped, but I don't have a strong opinion either way.

This could very easily be modified in lines 321,323 of build_all.py during the validation step.

@sgosline
Copy link
Member

Duplicate of #246.

@sgosline sgosline closed this as not planned Won't fix, can't repro, duplicate, stale Nov 12, 2024
@github-project-automation github-project-automation bot moved this from Backlog to Done in CoderData Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

3 participants