Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add G groups, frequency data, and update protein sequence to #90

Draft
wants to merge 42 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
55b2e52
First pass: Get G groups, get full CDS sequences, import population g…
apmody Mar 12, 2021
3cba443
Reworked parsing of hla.dat file, add template strings
apmody Mar 12, 2021
64d9499
Get appropriate exons for G group when MHC allele is class I (exon 2 …
apmody Mar 18, 2021
9ad56b0
Merge branch 'master' into G_group
apmody Mar 18, 2021
7ad3293
Added code to add terms to index.tsv and error report
apmody Mar 21, 2021
fa40809
Merge branch 'master' into G_group
apmody Mar 21, 2021
f6811b8
Changed template strings
apmody Mar 22, 2021
a9188d2
Added alleles which have partial sequence
apmody Mar 22, 2021
6b868b1
Organized gen_allele_update_seq.py into functions, Modified import.txt
apmody Mar 22, 2021
5b3e511
First pass at extracting population frequency data
apmody Mar 23, 2021
6be8051
Merge branch 'master' into G_group
apmody Mar 23, 2021
e5c5f30
Added code to verify accession numbers from frequency data and build/…
apmody Mar 24, 2021
5f5e776
Second pass adding population frequency data for chains, G groups, re…
apmody Mar 25, 2021
fb31e00
Add frequency of gene_alleles
apmody Mar 25, 2021
103ccb7
Added helper function for getting G group exons, removed requirement …
apmody Mar 29, 2021
919ae1c
Fix protein sequence to get best protein sequence
apmody Mar 30, 2021
87e938a
Fixed bug in update_gene_allele_seq to add IMGT accession
apmody Mar 30, 2021
7f7f8d2
Merge branch 'master' into G_group
apmody Mar 30, 2021
928988e
Reset chain-sequence.tsv, chain.tsv, molecule.tsv, index.tsv and upda…
apmody Apr 7, 2021
de416dd
Fix bugs in updating G groups, population frequency, and gene sequences
apmody Apr 7, 2021
98a88d7
Fix bug to add more frequency data for chains
apmody Apr 8, 2021
7354cd2
Merge branch 'master' into G_group
apmody Apr 8, 2021
ce7ea5f
Added NCIT terms for MHC genes.
apmody Apr 10, 2021
9ed1dab
Fixed makefile, header on ontology templates, added definition source
apmody Apr 15, 2021
42b6f99
Moved external_ncit.tsv to external-ncit.tsv
apmody Apr 15, 2021
0b5af01
Update chain-frequencies.tsv and gene-alleles.tsv ontology tables
apmody Apr 15, 2021
f3c590a
Change table name in Makefile, added allele information
apmody Apr 27, 2021
f43c6f6
Merge branch 'master' into G_group
apmody Jun 25, 2021
e01eb6f
Drop repeated two-field entries. Add World instead of Total populatio…
apmody Jul 8, 2021
49c7dd5
Update HLA chains and and subclass locus for G group
apmody Jul 9, 2021
1845f64
Merge branch 'master' into G_group
apmody Jul 9, 2021
d8be297
Add updated index.tsv
apmody Jul 9, 2021
8cb045f
Merge branch 'master' into G_group
apmody Jul 9, 2021
18b5d37
Update Makefile for update_gene_alleles_seq.py
apmody Jul 10, 2021
40340b5
Remove external-obi.tsv from assign-ids.py. Fixed locus column for ge…
apmody Jul 14, 2021
a8c9bc3
Fixed prefixes.sql
apmody Jul 14, 2021
12c72a2
Delete external-obi.tsv
apmody Jul 14, 2021
6b0dfcb
Rename ontology/allele-information.tsv to gene-allele.tsv
apmody Jul 14, 2021
af968b2
Removed allele information stuff and put under gene-allele
apmody Jul 14, 2021
ae10487
Fixed index.tsv true value in obsolete column
apmody Jul 14, 2021
3953b42
Moved frequency-properties.tsv to properties.tsv
apmody Jul 26, 2021
43ddaa1
Add tables for mro.xlsx target
apmody Jul 26, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 35 additions & 8 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ LIB = lib
ROBOT := java -jar build/robot.jar
TODAY := $(shell date +%Y-%m-%d)

tables = external core genetic-locus haplotype serotype chain molecule haplotype-molecule serotype-molecule mutant-molecule evidence chain-sequence
tables = external core genetic-locus haplotype serotype chain molecule haplotype-molecule serotype-molecule mutant-molecule evidence chain-sequence gene-allele G-group gene-alleles properties chain-frequencies G-group-frequencies gene-allele-frequencies gene-allele
source_files = $(foreach o,$(tables),ontology/$(o).tsv)
build_files = $(foreach o,$(tables),build/$(o).tsv)
templates = $(foreach i,$(build_files),--template $(i))
Expand Down Expand Up @@ -99,7 +99,7 @@ load: $(COGS_SHEETS)
push: | .cogs
cogs push

.PHONY: destroy
.PHONY: destroy
destroy: | .cogs
cogs delete -f

Expand Down Expand Up @@ -177,7 +177,7 @@ build/mutant-molecule.tsv: src/scripts/synonyms.py build/mutant-molecule-fixed.t
python3 $^ > $@

# Represent tables in Excel
mro.xlsx: src/scripts/tsv2xlsx.py index.tsv iedb/iedb.tsv ontology/genetic-locus.tsv ontology/haplotype.tsv ontology/serotype.tsv ontology/chain.tsv ontology/chain-sequence.tsv ontology/molecule.tsv ontology/haplotype-molecule.tsv ontology/serotype-molecule.tsv ontology/mutant-molecule.tsv ontology/core.tsv ontology/external.tsv iedb/iedb-manual.tsv ontology/evidence.tsv ontology/rejected.tsv
mro.xlsx: src/scripts/tsv2xlsx.py index.tsv iedb/iedb.tsv ontology/genetic-locus.tsv ontology/haplotype.tsv ontology/serotype.tsv ontology/chain.tsv ontology/chain-sequence.tsv ontology/molecule.tsv ontology/haplotype-molecule.tsv ontology/serotype-molecule.tsv ontology/mutant-molecule.tsv ontology/core.tsv ontology/external.tsv iedb/iedb-manual.tsv ontology/evidence.tsv ontology/rejected.tsv ontology/G-group.tsv ontology/gene-allele.tsv ontology/gene-alleles.tsv ontology/properties.tsv ontology/chain-frequencies.tsv ontology/G-group-frequencies.tsv ontology/gene-allele-frequencies.tsv
python3 $< $@ $(wordlist 2,100,$^)

update-tsv: update-tsv-files sort build/whitespace.tsv
Expand All @@ -201,6 +201,8 @@ sort:
build/whitespace.tsv: src/scripts/validation/detect_whitespace.py index.tsv iedb/iedb.tsv iedb/iedb-manual.tsv $(source_files)
python3 $^ $@

build/HLA-%-frequency.xlsx: | build
curl -o $@ -L "https://s3.eu-central-1.amazonaws.com/ihiw.website.data/CIWD-3.0/HLA-$*_PrimaryData-IHWS-20200320.xlsx"

### Sequences

Expand All @@ -212,6 +214,15 @@ build/hla.fasta: | build
build/mhc.fasta: | build
curl -L -o $@ ftp://ftp.ebi.ac.uk/pub/databases/ipd/mhc/MHC_prot.fasta

build/hla.dat: | build
curl -o $@ -L https://github.com/ANHIG/IMGTHLA/raw/Latest/hla.dat

build/hla1.dat: | build
curl -o $@ -L https://raw.githubusercontent.com/ANHIG/IMGTHLA/3310/hla.dat

build/hla_nom_g.txt: | build
curl -o $@ -L https://github.com/ANHIG/IMGTHLA/raw/Latest/wmda/hla_nom_g.txt

# update-seqs will only write seqs to terms without seqs
.PHONY: update-seqs
update-seqs: src/scripts/update_seqs.py ontology/chain-sequence.tsv build/hla.fasta build/mhc.fasta
Expand All @@ -228,6 +239,17 @@ build/hla_prot.fasta: | build
build/AlleleList.txt: | build
curl -o $@ -L https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/Allelelist.txt

src/dbfetch.py:
curl -o $@ -L https://raw.githubusercontent.com/ebi-wp/webservice-clients/master/python/dbfetch.py

.PHONY: update-G-groups
update-G-groups: build/hla.dat build/hla_nom_g.txt ontology/chain-sequence.tsv
python3 src/scripts/alleles/update_gene_allele_seq.py -u

.PHONY: add-frequency-data
add-frequency-data: src/dbfetch.py build/hla1.dat ontology/G-group.tsv ontology/gene-alleles.tsv build/report-g-grp.json build/HLA-A-frequency.xlsx build/HLA-B-frequency.xlsx build/HLA-C-frequency.xlsx build/HLA-DRB1-frequency.xlsx build/HLA-DRB3-frequency.xlsx build/HLA-DRB4-frequency.xlsx build/HLA-DRB5-frequency.xlsx build/HLA-DQB1-frequency.xlsx build/HLA-DPB1-frequency.xlsx
python3 src/scripts/alleles/update_gene_allele_seq.py -f

.PHONY: update-alleles
update-alleles: src/scripts/alleles/update_human_alleles.py ontology/chain-sequence.tsv ontology/chain.tsv ontology/molecule.tsv ontology/genetic-locus.tsv index.tsv build/hla_prot.fasta build/AlleleList.txt
python3 $^
Expand All @@ -250,12 +272,12 @@ update-sla-alleles: src/scripts/alleles/update_sla_alleles.py ontology/chain-seq


### OWL Files

mro.owl: build/mro-import.owl index.tsv $(build_files) ontology/metadata.ttl | build/robot.jar
$(ROBOT) template \
--input $< \
--prefix "MRO: $(OBO)/MRO_" \
--prefix "REO: $(OBO)/REO_" \
--prefix "NCIT: $(OBO)/NCIT_" \
--template index.tsv \
$(templates) \
--merge-before \
Expand All @@ -269,12 +291,13 @@ mro.owl: build/mro-import.owl index.tsv $(build_files) ontology/metadata.ttl | b
--annotation-file ontology/metadata.ttl \
--output $@

build/mro-import.owl: build/eco-import.ttl build/iao-import.ttl build/obi-import.ttl build/ro-import.ttl ontology/import.txt | build/robot.jar
build/mro-import.owl: build/eco-import.ttl build/iao-import.ttl build/obi-import.ttl build/ro-import.ttl build/hancestro-import.ttl ontology/import.txt | build/robot.jar
$(ROBOT) merge \
--input build/eco-import.ttl \
--input build/obi-import.ttl \
--input build/ro-import.ttl \
--input build/iao-import.ttl \
--input build/hancestro-import.ttl \
extract \
--method MIREOT \
--upper-term "GO:0008150" \
Expand All @@ -283,14 +306,19 @@ build/mro-import.owl: build/eco-import.ttl build/iao-import.ttl build/obi-import
--upper-term "ECO:0000000" \
--upper-term "BFO:0000040" \
--upper-term "PR:000000001" \
--lower-terms $(word 5,$^) \
--lower-terms $(word 6,$^) \
--output $@

# fetch ontology dependencies
$(LIB)/%:
mkdir -p $(LIB)
cd $(LIB) && curl -LO "$(OBO)/$*"

$(LIB)/ro.owl: build/robot.jar
mkdir -p $(LIB)
cd $(LIB) && curl -LO "$(OBO)/ro.owl"
$(ROBOT) merge --input $@ --output $@

UC = $(shell echo '$1' | tr '[:lower:]' '[:upper:]')

# OBI IAO:0000115 has mulitples so get the definiton from here
Expand All @@ -301,7 +329,7 @@ build/%.txt: ontology/import.txt | build
# RO:0000056 isn't in RO?
# we could also just add this to index.tsv
build/obi.txt: ontology/import.txt | build
sed '/^ECO/d' $< | sed '/^RO/d' | sed '/^IAO/d' > $@
sed '/^ECO/d' $< | sed '/^RO/d' | sed '/^IAO/d' | sed '/^HANCESTRO/d' | sed '/^GSSO/d' | sed '/^NCIT/d' | sed '/^IDO/d' > $@
echo "RO:0000056" >> $@

build/%.db: src/queries/prefixes.sql $(LIB)/%.owl | build/rdftab
Expand Down Expand Up @@ -439,7 +467,6 @@ prepare: update-seqs
prepare: update-iedb
prepare:
pip install -r requirements.txt

.PHONY: clean
clean:
rm -rf mro.owl
Expand Down
Loading