Dataset consistency #229

ymahlich · 2024-10-11T19:01:44Z

Downloading the data from figshare I realized that some of the dataset files are tab separated, some comma separated, some compressed, some uncompressed.

Is there a specific reason for that? I understand that everything is handled internally so as long as I am directly interacting with the data through coderdata objects, it might not be important but maybe we want to be consistent about this?

As far as I can tell, currently the "schema" behind what datatype is which format is as follows:

data type	csv	csv.gz	tsv.gz
copy_number		x
drug_descriptor			x
drugs			x
experiments			x
mutations		x
proteomics		x
samples	x
transcriptomics		x

sgosline · 2024-10-11T19:59:17Z

Yes! First off, pandas can read in whatever so it doesn't matter. I default to csv unless it has drug information, which requires tabs (because drug descriptors and names have commas and quotes in them). we could probably gzip the samples files for consistency.

jjacobson95 · 2024-10-14T16:15:27Z

I like that we can preview the samples files on figshare when not gzipped, but I don't have a strong opinion either way.

This could very easily be modified in lines 321,323 of build_all.py during the validation step.

sgosline · 2024-11-12T01:07:20Z

Duplicate of #246.

ymahlich added this to CoderData Oct 11, 2024

jjacobson95 added the 'build' update label Oct 16, 2024

sgosline moved this to Backlog in CoderData Oct 16, 2024

sgosline mentioned this issue Nov 12, 2024

Update python functions to adhere to simpler standards and pre-format data #246

Open

sgosline closed this as not planned Won't fix, can't repro, duplicate, stale Nov 12, 2024

github-project-automation bot moved this from Backlog to Done in CoderData Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset consistency #229

Dataset consistency #229

ymahlich commented Oct 11, 2024

sgosline commented Oct 11, 2024

jjacobson95 commented Oct 14, 2024

sgosline commented Nov 12, 2024

Dataset consistency #229

Dataset consistency #229

Comments

ymahlich commented Oct 11, 2024

sgosline commented Oct 11, 2024

jjacobson95 commented Oct 14, 2024

sgosline commented Nov 12, 2024