This is a demonstration system that uses ontologies to harmonize data on various cohorts for the International HundredK+ Cohorts Consortium (IHCC).
- a Unix system, e.g. Linux, macOS, or possibly Windows PowerShell or Cygwin (not tested)
- git
- GNU Make
- Java 8 or greater
- Python 3.6 or greater
Alternatively you can use Docker (see below).
- Clone this repository to your local machine and
cd
to the new directory - Optional but recommended: run Python with
venv
- Install Python requirements:
python3 -m pip install -r requirements.txt
(python3
may be replaced with justpython
if this does not work) - Run
make update
and openbuild/index.html
in your browser to see results
$ git clone https://github.com/IHCC-cohorts/data-harmonization.git
$ cd data-harmononization
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt
$ make update
If you have docker installed, you can instead run the build steps inside an ODK Docker container:
$ git clone https://github.com/IHCC-cohorts/data-harmonization.git
$ cd data-harmononization
$ sh odk.sh make update
Note that the ODK is currently configured to consume up to 4GB of RAM; should you ever need more than that, please edit the memory configuration parameters in the odk.sh file.
- Manual Files: files that are manually updated and maintained in version control
- Generated Files in Version Control: files that are automatically rebuilt and maintained in version control
- Generated Files in Build: files that are automatically built and not maintained in version control
Manual Files
data/member_cohorts.csv
: IHCC Member Cohortsdata/metadata.json
: short names (id
) and prefixes for all cohorts
Generated Files in Version Control
data/cohort-data.json
: cohort data fromdata/member_cohorts.csv
and GECKO mappings (upper-level categories)mappings/index.csv
: full GECKO index and OBO terms used in GECKO (to be moved to GECKO repo) from GECKO Mappingsmappings/properties.csv
: properties used in mapping GECKO->OBO terms (to be moved to GECKO repo) from GECKO Mappings
Generated Files in Build
build/index.html
: HTML table containing links to cohort build files (this should be checked after each build to make sure expected files were all properly built)
Manual Files
metadata/*.ttl
: ontology header for the cohort containing details about the data dictionary (description, license, etc.)
Generated Files in Version Control
templates/*.csv
: ROBOT template for cohort data dictionary to build ontology filemappings/*.csv
: ROBOT template for cohort data dictionary mappings to GECKO
Generated Files in Build
build/*.owl
: OWL ontology representation of data dictionary built from the ROBOT templatebuild/*.html
: table of data dictionary terms and detailsbuild/*-tree.html
: Browsable tree-view of the OWL ontology representationbuild/*-gecko.html
: Browsable tree-view of the data dictionary to GECKO mappings
First, clone this repository to your local machine and create a new branch. Then, follow the following steps and make a pull request with all required changes.
The "short name" will be the lowercase name used for all files generated for your cohort. Some guidelines to follow when selecting a short name are:
- select a name that is unique and easy to remember (e.g., Golestan Cohort Study =
gcs
) - use an acronym (e.g., South African Population Research Infrastructure Network (SAPRIN) =
saprin
) - replace spaces with dashes (e.g., Genomics England =
genomics-england
)
Your prefix should be your short name in uppercase (e.g, gcs
= GCS
), unless there is some reason to change it. You should shorten your prefix if your short name uses more than one word (e.g., genomics-england
= GE
). We request that you do not use underscores in your prefix. Like the short names, prefixes must be unique across all cohorts.
Cohort data dictionaries are transformed into OWL ontologies using ROBOT templates. All new cohorts must provide a tab on our ROBOT template Google spreadsheet with their data dictionary in this format. Name this tab with your chosen cohort short name. We recommend that you take a moment to briefly look over the template documentation, but you do not need to read it in-depth; we will provide all required template strings below.
The following information is required:
- ID
- Label
All IDs must be unique and must be in CURIE format. Use the prefix you've chosen in the last step as the namespace. We recommend that you use 7-digit numeric IDs (e.g., GCS:0000001
), but if you already have IDs for your items you may reuse those (e.g., MAELSTROM:Blood_immune_dis
). Please note that IDs are case sensitive and once an ID has been created for a term, that ID is permanent.
We recommend that all labels be unique, as well, but this is not required.
The following information is highly recommended:
- Parent: category under which this data item falls (this must also be defined in your data dictionary)
- Definition: one-sentence description of what kind of data is collected for this item
The following information is optional:
- Comment: an editor's comment about this data dictionary item
- Answer Type: type of answer (e.g., numerical, date, etc.)
- Formula: formula to determine this data value, if calculated
- Measurement Time: time period for which is data is collection (e.g., over time?)
- Question Description: the question asked to collect this data
If you need another property to describe your data dictionary items, please let us know by opening a new issue. The properties must be added to our repository before they can be used in a template.
The first row of the spreadsheet should be human-readable column headers. The second row will be the ROBOT template strings for each column. The third row should be left empty for future validation. The data dictionary entries should start on row 4.
The basic ROBOT template strings are as follows (note that the A
and C
characters in the template strings are necessary for ROBOT to properly parse the contents of a column):
- ID:
ID
- Label:
LABEL
- Parent:
C % SPLIT=|
- Definition:
A definition
- Comment:
A comment
- Answer Type:
A answer type
- Question Description:
A question description
The table may end up looking something like this:
ID | Label | Parent | Definition |
---|---|---|---|
ID | LABEL | C % SPLIT=| | A definition |
EX:0000000 | Date of Birth | Patient Data | Date of birth of patient |
Any parent used in column 3 must also be defined in the table, otherwise ROBOT will not be able to parse the row. If you have an item with more than one parent, you can separate the parents with a pipe symbol (e.g., Parent 1|Parent 2
).
The cohort browser is driven by GECKO. In order to display results for your cohort in the browser, you must map your data dictionary items to the GECKO terms.
To start, add a tab to the master GECKO mapping sheet. This tab must be named with the cohort short name. You may need to request edit access to this sheet to proceed.
Each mapping sheet is also a ROBOT template. The first four columns and their ROBOT template strings are as follows:
- ID:
ID
- Label: no ROBOT template string (labels are already defined by your first Google sheet, so ROBOT only needs the IDs)
- Mapping Type:
CLASS_TYPE
- GECKO Term:
C % SPLIT=|
You can add additional details starting in column 5 (e.g., comments, parents). These additional details do not need ROBOT template strings.
ID | Label | Mapping Type | GECKO Term |
---|---|---|---|
ID | CLASS_TYPE | C % SPLIT=| | |
ID of term from your data dictionary | Label of term from your data dictionary | type of mapping (this determines what kind of OWL axiom is created): subclass (close-match, more specific) or equivalent (exact) |
GECKO term to map to (referred to by label) |
EX:0000000 | Date of Birth | subclass | Age/birthdate |
If a term from your data dictionary maps to more than one GECKO term, you can include multiple mappings in column 4 by separating the values with a pipe symbol (e.g., Term 1|Term 2
).
Include all your data dictionary IDs and Labels to begin. You can leave columns 3 and 4 empty for these rows until you have started your mappings. All GECKO terms can be found in the index of the master mapping sheet. You can also browse a hierarchical version of GECKO.
Using the following template, add an entry to data/metadata.json
:
"[full cohort name]": {
"id": "[cohort short name]",
"prefix": "[cohort prefix]"
}
Please note that the [full cohort name]
must match the name recorded in data/member_cohorts.csv
.
Using the following template, add an entry to src/prefixes.json
:
"[cohort prefix]": "https://purl.ihccglobal.org/[cohort prefix]_"
You must also create a Turtle-format header containing your cohort's metadata, which will be included in the ontology version of your cohort data dictionary.
To do this, create a new file in the metadata
directory using this name: [cohort short name].ttl
. Then, paste this template in that file and replace any square brackets. Do not change anything else.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix prov: <http://www.w3.org/ns/prov#> .
<>
rdf:type owl:Ontology ;
dcterms:title "[full cohort name]" ;
dcterms:description "[one-sentence description of your data dictionary]" ;
dcterms:license <[link to license]> ;
dcterms:rights "[text description of license]" .
Before updating the Makefile
, make sure you have done the following:
- Selected a short name & prefix
- Created a ROBOT template tab in the ROBOT templates Google sheet containing all data dictionary items
- Created a tab for the cohort in the master GECKO mappings sheet and added data dictionary items to this sheet
- Added entries in
data/metadata.json
andsrc/prefix.json
- Created a TTL header in the
metadata
folder
To add your cohort to the build, simply add the cohort short name to the list on line 38. Next, run make update
to ensure all tasks complete properly. This should generate all build files for your cohort, and add your cohort to build/index.html
. Open the index in your browser and check that all the links direct to the proper pages.
Please commit the following new files (do not commit anything in the build
directory):
templates/[cohort short name].csv
metadata/[cohort short name].ttl
mappings/[cohort short name].csv
Also commit all changes to the following files:
Makefile
src/prefixes.json
data/metadata.json
data/cohort-data.json
Below are suggestions for common errors seen while running the build. If you run into any of these errors, try the recommended fixes and then run make update
again. make update
will always pull the latest data from Google sheets. You can also run make refresh
to just update the data from Google sheets, and run the Make command for the individual task that failed (e.g., make build/[cohort sort name].owl
).
MALFORMED RULE ERROR malformed rule
: Make sure that row 3 of your cohort's ROBOT template is empty (not the mapping template)UNKNOWN ENTITY ERROR could not interpret...
: Make sure you have defined your prefix insrc/prefixes.json
, and that all rows use this prefix in theID
column
make: *** No rule to make target 'metadata/[cohort short name].ttl'
: Make sure you have created the TTL header and saved it with the correct short nameMakefile:[line]: *** missing separator
: Make sure that your newly addedMakefile
tasks use tabs and not spaces (the tab character identifies something as a "rule" to make the target in a Makefile)
TODO
If you run into other errors while trying to add a new cohort, please open an issue and include the full stack trace from the error.
The code in this repository can be used under the Apache 2.0 License.
The data in this repository can be used under the CC-BY 4.0 License.
The IHCC automated mapping pipeline is responsible for generating the generating the suggested GECKO categories for each term in a data dictionary. The pipeline has two major components:
- The mapping suggestion pipeline.
- The Zooma dataset generation pipeline
The mapping suggestion pipeline can be invoked as follows:
make all_mapping_suggest
For all currently registered templates, it will:
- Given a template, generate suggested GECKO categories based on existing mappings (using the basic mapping facility from ZOOMA, and some basic Natural Language Processing).
- Run some quality control checks that notify the IHCC data admin of oddities like identical terms being mapped to different GECKO categories.
Some notebooks and more technical documentation can be found here.
Rather than simply loading all the data dictionaries into zooma as is, we chose to load a slighly processed version of the existing mappings. For example, a data dictionary may contain a term like: FoodIntake1
; while we map that term directly, we also map a slightly processed version of the term by splitting numbers, strings and camel case: Food Intake 1
. Zooma can handle capitalisation, so there was no need to to lower casing.
The Zooma dataset generation pipeline can be invoked as follows:
make data/ihcc-mapping-suggestions-zooma.tsv
Note, however, that this will never be necessary in isolation, because the pipeline is executed as part of the default data harmonization pipeline as well:
make all