-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset consistency #229
Comments
Yes! First off, pandas can read in whatever so it doesn't matter. I default to csv unless it has drug information, which requires tabs (because drug descriptors and names have commas and quotes in them). we could probably gzip the samples files for consistency. |
I like that we can preview the samples files on figshare when not gzipped, but I don't have a strong opinion either way. This could very easily be modified in lines 321,323 of build_all.py during the validation step. |
Duplicate of #246. |
Downloading the data from figshare I realized that some of the dataset files are tab separated, some comma separated, some compressed, some uncompressed.
Is there a specific reason for that? I understand that everything is handled internally so as long as I am directly interacting with the data through coderdata objects, it might not be important but maybe we want to be consistent about this?
As far as I can tell, currently the "schema" behind what datatype is which format is as follows:
The text was updated successfully, but these errors were encountered: