James A. Overton [email protected]
Rebecca Tauber [email protected]
The IEDB Molecule Finder includes all species and ancestors from the organism tree, plus proteins from their reference proteomes.
This code is a revised version of the protein tree, designed to replace the legacy protein tree code.
It generates two different products:
protein-tree.owl.gz
proteins used in IEDB linked to their NCBITaxon organism speciesmolecule-tree.owl.gz
protein tree plus non-peptide tree
- a Unix system (Linux, macOS)
- GNU Make 3.81+
- Python 3.6 with the
rdflib
package - Java 8 for running ROBOT
To make all products, run:
make all
This will generate the gzipped versions of both trees, and then clean up the intermediate build files.
To create the protein and molecule trees without removing the intermediate build files, use:
make trees
The process generates the protein tree from various tabular inputs and merges with the legacy non-peptide tree to create the molecule tree. The necessary dependencies that must be manually added are:
dependencies/parent_protein.tsv
assignments of proteins referenced in the IEDB and their parent proteomes- Used to generate
dependencides/parent-proteins.csv
- Used to generate
dependencies/source_parent.tsv
assignments of all sources to reference proteins from the reference proteomes- Used to generate
dependencies/source-parents.csv
- Used to generate
dependencies/proteomes.tsv
assignments of proteome species to their proteome IDs
The other dependencies are automatically retrieved with curl
(force update by deleting):
dependencies/organism-tree.owl
nodes for all taxa used by the IEDBdependencies/subspecies-tree.owl
organism tree plus all ranks used by the IEDBdependencies/non-peptide-tree.owl
non-peptide molecular entities
All products here are generated in the temp
directory.
organism-proteins.ttl
all classes fromorganism-tree.owl
as proteinsupper.ttl
top-level structure for proteins including 'protein' and 'material entity'source-synonyms.ttl
synonyms as annotations fromsource-parents.csv
iedb-proteins.ttl
proteins fromparent-proteins.csv
as subclasses of their species proteinncbi-classes.tsv
all NCBITaxon classes used by IEDB proteins as proteome IDsincluded-classes.tsv
table of NCBITaxon species included inorganism-proteins.ttl
missing-classes.txt
list of NCBITaxon species NOT included inorganism-proteins.ttl
, but used iniebd-proteins.ttl
taxon-proteins.owl
missing subspecies (inmissing-classes.txt
) filtered fromsubspecies-tree.owl
merged.owl
combination oftaxon-proteins.owl
,upper.ttl
,iedb-proteins.ttl
,source-synonyms.ttl
, andbranches.owl.gz
(see below)
The full branches.owl.gz
file contains all species protein branches. These are build from their UniProt reference proteomes. This process works with any species that has a reference protein, specified by dependencies/proteomes.tsv
. This file is merged into the protein tree to include all details about a species proteome.
Each time a new parent-protein table is used to run a build (dependencies/parent_protein.tsv
), a new proteome for a species in the organism tree will be fetched only if its proteins in the parent-protein table have changed. If these proteins have changed, an updated reference proteome will be downloaded from UniProt and used to rebuild the branch node for that species.
After the build is over, the dependencies/parent-proteins.csv
file is copied to dependencies/parent-proteins-last.csv
. If this file does not exist, all proteomes will be re-downloaded. When a new parent-proteins table is generated or added, it is compared to the -last
version to find differences in proteins.