This repository is intended to act as a store of example data files from across the NCI Cancer Research Data Commons nodes in a number of formats. Each directory represents a single dataset downloaded from a node, and contains a Jupyter Notebook documenting how they were downloaded. CCDH will use this example data to build and test the CRDC-H data model.
Our first example is based on a dataset of 560 cases that we downloaded from the GDC Public API. In a Jupyter Notebook, we describe how we can load this data into Python Data Classes and then export it as YAML, JSON-LD or Turtle. This is not yet intended to be a comprehensive transform of all the retrieved GDC case, but to showcase the features made available through the Python Data Classes that are part of the artifacts generated from the CRDC model. The JSON-LD and Turtle exports of the data are also available.
This example is based on CRDC-H model v1.0-pre1 of the CCDH model, which is included in this repository. We will continue to update this as the model develops, but may be out of sync with the latest version of the model until we have the time to update it.
Many of the processes in this repository are documented in
Jupyter Notebook format files,
which have an .ipynb
extension. These files can be viewed directly in
GitHub (see
CDA example for subject 09CO022
as an example). You can also run it in the Jupyter Notebook viewer (see
CDA example for subject 09CO022
as an example).
If you would like to execute this file, you will need to
install Jupyter Notebook (also available on Homebrew for Mac). You can then download
the .ipynb
file and open it in Jupyter Notebook on your computer by running:
$ jupyter notebook cptac2-subject-09CO022/CDA\ example\ for\ subject\ 09CO022.ipynb
This repository uses Poetry for dependency management. You can therefore also install Poetry, then run:
$ poetry install
$ poetry run jupyter notebook