This will take an xlsx file and generate a project_0.json
suitable for uploading to the DSS.
This will then upload the project_0.json
and a (mostly) empty links.json
which will populate a new project in
the browser.
Source a new environment and install dependencies:
virtualenv -p python3.6 v3nv && . v3nv/bin/activate && pip install -r requirements.txt
Parse the xlsx:
#!/usr/bin/env bash
# E-GEOD-81547_curated_ontologies_07_2019.xlsx
# DSS prod uuid: cddab57b-6868-4be4-806f-395ed9dd635a
python xlsx_to_project_json.py --xlsx data/test_000.xlsx
# Gary_Bader_9_16.xlsx
# DSS prod uuid: 4d6f6c96-2a83-43d8-8fe1-0f53bffd4674
python xlsx_to_project_json.py --xlsx data/test_001.xlsx
# GEOD-93593_HCA_Ontologies_July_2.xlsx
# DSS prod uuid: 2043c65a-1cf8-4828-a656-9e247d4e64f1
python xlsx_to_project_json.py --xlsx data/test_002.xlsx
# hca-metadata-spreadsheet-GSE84133_pancreas.xlsx
# DSS prod uuid: f86f1ab4-1fbb-4510-ae35-3ffd752d4dfc
python xlsx_to_project_json.py --xlsx data/test_003.xlsx
# hca-metadata-spreadsheet-GSE95459-GSE114374-colon.xlsx
# DSS prod uuid: f8aa201c-4ff1-45a4-890e-840d63459ca2
python xlsx_to_project_json.py --xlsx data/test_004.xlsx
# mf-E-GEOD-106540_spreadsheet_v9.xlsx
# DSS prod uuid: 90bd6933-40c0-48d4-8d76-778c103bf545
python xlsx_to_project_json.py --xlsx data/test_005.xlsx
Adding --upload true
will upload the data to the DSS.
Note that UUID's are now always programmatically generated from GEO accessions
and cannot be provided via the commandline.
NOTE:
Edited the following fields in "data/test_004.xlsx":
publications.publication_url -> publications.url
publications.publication_title -> publications.title
NOTE
6 ORIGINAL DATASETS (ALREADY IN THE DSS):
spreadsheets/existing/*.xlsx
are the original excel files provided that
currently exist in dss prod and we have finished examples to compare against.
71 RAW DATASETS (STATUS NOT PARSED)
The xlsx
files in spreadsheets/new
were downloaded from a spreadsheet of
spreadsheets and assumed to be (mostly) complete projects. These inputs were
provided with the labels "finished" or "full". Differences assumed are inferred
from skimming over the files. I chose to use the inputs which end in ".0.xlsx"
("finished") rather than the normal ".xlsx" extension ("full").
These are missing fields such as the "funders" section (as opposed to the 6 excel files above). Not sure of other differences yet.