Add G domain (peptide biding region) sequence to HLA chain terms. #109

apmody · 2021-07-22T20:20:31Z

Add G domain sequence for HLA molecules.

beckyjackson

With a few minor fixes, this builds (see specific comments). I think a better table name might be chain-g-domain, but that's not really necessary to change. This data could actually go in chain-sequence instead of a new table, but if we're only doing HLA G domains, then it might be unnecessary.

The ID for the new property you created will need to be updated since there's been updates to MRO since this PR was made.

Finally, running make update-G-domain logs some warnings from biopython. These might be worth investigating:

/Users/jackson/github/MRO/_venv/lib/python3.8/site-packages/Bio/Seq.py:2334: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
  warnings.warn(
/Users/jackson/github/MRO/_venv/lib/python3.8/site-packages/Bio/GenBank/Scanner.py:681: BiopythonParserWarning: EMBL sequence line missing coordinates
  warnings.warn(
['HLA00938',
 SeqRecord(seq=Seq('X'), id='HLA00022.1', name='HLA00022', description='HLA-A*02:17:01, Human MHC Class I sequence Sequence has now been shown to be in error and is identical to A*02:17:02:01 (August 2018)', dbxrefs=[]),
 SeqRecord(seq=Seq('X'), id='HLA02459.1', name='HLA02459', description='HLA-C*07:37:01:01, Human MHC Class I sequence Sequence has now been shown to be in error and is identical to C*07:37:01:02 (October 2020)', dbxrefs=[]),
 SeqRecord(seq=Seq('X'), id='HLA00867.1', name='HLA00867', description='HLA-DRB1*15:02:01:01, Human MHC Class II sequence Sequence has now been shown to be in error and is identical to DRB1*15:02:01:02 (March 2021)', dbxrefs=[]),
 SeqRecord(seq=Seq('X'), id='HLA24174.1', name='HLA24174', description='HLA-DQA1*02:06, Human MHC Class II sequence Sequence has now been shown to be in error and is identical to DQA1*02:01:01:01 (September 2020)', dbxrefs=[]),
 SeqRecord(seq=Seq('X'), id='HLA00624.1', name='HLA00624', description='HLA-DQB1*02:03:01, Human MHC Class II sequence Sequence extended and renamed DQB1*02:180 (November 2020)', dbxrefs=[]),
 SeqRecord(seq=Seq('X'), id='HLA09735.1', name='HLA09735', description='HLA-DQB1*06:92:01, Human MHC Class II sequence Sequence extended and renamed DQB1*06:385 (November 2020)', dbxrefs=[]),
 SeqRecord(seq=Seq('X'), id='HLA00508.1', name='HLA00508', description='HLA-DPA1*02:02:01, Human MHC Class II sequence Sequence has now been shown to be in error and is identical to DPA1*02:07:01:01 (March 2017)', dbxrefs=[])]

beckyjackson · 2021-07-27T18:50:09Z

src/scripts/alleles/G_domain.py

+with open("ontology/G-domain-sequence.tsv", "w") as fh:
+    writer = csv.DictWriter(fh, fieldnames = G_domains[0].keys(), delimiter = "\t")
+    writer.writeheader()
+    fh.write("LABEL\tA minimal G domain sequence")


This needs a \n at the end of this line, otherwise your first row ends up on the same line as the ROBOT template strings.

This should also match the label in ontology/properties.tsv, which is "minimal HLA G domain sequence".

beckyjackson · 2021-07-27T18:50:26Z

Makefile

@@ -232,6 +235,11 @@ build/AlleleList.txt: | build
 update-alleles: src/scripts/alleles/update_human_alleles.py ontology/chain-sequence.tsv ontology/chain.tsv ontology/molecule.tsv ontology/genetic-locus.tsv index.tsv build/hla_prot.fasta build/AlleleList.txt
 	python3 $^

+.PHONY: update-G-domiain


Should be update-G-domain, also remove the blank line after this please

rvita · 2021-07-27T19:20:34Z

i thought this info was going in chain-sequence as discussed in the ontology mtg

…

On Tue, Jul 27, 2021 at 11:56 AM Becky Jackson ***@***.***> wrote: @beckyjackson requested changes on this pull request. With a few minor fixes, this builds (see specific comments). I think a better table name might be chain-g-domain, but that's not really necessary to change. This data could actually go in chain-sequence instead of a new table, but if we're only doing HLA G domains, then it might be unnecessary. ________________________________ In src/scripts/alleles/G_domain.py: > + else: + G_domain = G_domain + G_domains.append({"Label": data[entry.name], "minimal G domain sequence" : G_domain}) + except BiopythonWarning: + print("BiopythonWarning") + except AttributeError: + if str(entry.seq) == 'X': + excluded_sequence.append(entry) + +print(excluded_sequence) + +import csv +with open("ontology/G-domain-sequence.tsv", "w") as fh: + writer = csv.DictWriter(fh, fieldnames = G_domains[0].keys(), delimiter = "\t") + writer.writeheader() + fh.write("LABEL\tA minimal G domain sequence") This needs a \n at the end of this line, otherwise your first row ends up on the same line as the ROBOT template strings. ________________________________ In Makefile: > @@ -232,6 +235,11 @@ build/AlleleList.txt: | build update-alleles: src/scripts/alleles/update_human_alleles.py ontology/chain-sequence.tsv ontology/chain.tsv ontology/molecule.tsv ontology/genetic-locus.tsv index.tsv build/hla_prot.fasta build/AlleleList.txt python3 $^ +.PHONY: update-G-domiain Should be update-G-domain, also remove the blank line after this please ________________________________ In src/scripts/alleles/G_domain.py: > + else: + G_domain = G_domain + G_domains.append({"Label": data[entry.name], "minimal G domain sequence" : G_domain}) + except BiopythonWarning: + print("BiopythonWarning") + except AttributeError: + if str(entry.seq) == 'X': + excluded_sequence.append(entry) + +print(excluded_sequence) + +import csv +with open("ontology/G-domain-sequence.tsv", "w") as fh: + writer = csv.DictWriter(fh, fieldnames = G_domains[0].keys(), delimiter = "\t") + writer.writeheader() + fh.write("LABEL\tA minimal G domain sequence") This should also match the label in ontology/properties.tsv, which is "minimal HLA G domain sequence". — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub, or unsubscribe.

-- Randi Vita, M.D. Lead Ontology and Quality Manager Immune Epitope Database and Analysis Project La Jolla Institute for Allergy & Immunology 9420 Athena Circle La Jolla, Ca 92037 ***@***.*** www.immuneepitope.org 858-752-6912

jamesaoverton · 2021-08-03T19:32:26Z

#112 has been merged. @apmody please update this PR from master and address all the requested changes.

beckyjackson

Please add the G domain sequences to the chain-sequence sheet instead of a new sheet, as Randi requested. Your script should update this sheet as well.
Do we need a properties sheet? If we put the "minimal HLA G domain sequence" in the index, it will have its label, but it just won't have a definition. It's one less sheet to keep track of and I would prefer not to add more sheets, but I'd like to get @jamesaoverton's opinion. EDIT: Please remove the properties sheet.
The make update-G-domain process now hangs for me. I can't kill it with ctrl+C either (removing pdb.set_trace() resolves this - pdb should be removed from this anyway):

python3 src/scripts/alleles/G_domain.py
> /Users/jackson/github/MRO/src/scripts/alleles/G_domain.py(161)<module>()
-> with open("biopython.log", "r") as log_file:
(Pdb) 
(Pdb) --KeyboardInterrupt--

Running make update-G-domain completely changes the G-domain-sequence sheet, so you can't tell what has actually changed when running git diff. The diff should only show changes, not line reordering, line terminator changes, whitespace, etc... I would expect running this to not result in any changes right now since it should be up-to-date. I left a comment in the script that should resolve the line terminator issue, but I still see reordering of the lines when I run it with this added.
I can't tell if the other warnings from my original review were resolved because there's too much going on in the log file. I think a lot of this logging is unnecessary.

beckyjackson · 2021-08-05T14:08:39Z

src/scripts/alleles/G_domain.py

+    "DRB2"
+}
+
+logging.basicConfig(filename = 'biopython.log', filemode = 'w', level = logging.DEBUG )


This log file should probably go in the build directory. This will log everything from both Biopython and this script, so biopython.log is not the most informative name.

Is there a reason you're using a log file instead of logging to the console? Looking at the log file, there is a LOT of stuff and it's hard to tell what is relevant. For example, I see a lot of:

WARNING:py.warnings:/Users/jackson/opt/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py:681: BiopythonParserWarning: EMBL sequence line missing coordinates warnings.warn(

There's also almost 50,000 lines in the log file.

Sorry, I should have mentioned this before. This warning, when I am parsing the build/hla.dat file specifically happens when there is no nucleotide sequence like here. I spent some time trying to find a way around this, but I couldn't because the problem is with how Biopython parses the file. The rest of the lines are just there to help in debugging (specifically which entry in build/hla.dat is causing problems). I am pushing an update to fix this.

beckyjackson · 2021-08-05T14:16:12Z

src/scripts/alleles/G_domain.py

+            excluded_sequence.append(entry)
+    logging.info(entry.name + " ending")
+
+import pdb; pdb.set_trace()


Please remove pdb

beckyjackson · 2021-08-05T14:26:25Z

src/scripts/alleles/G_domain.py

+
+import csv
+with open("ontology/G-domain-sequence.tsv", "w") as fh:
+    writer = csv.DictWriter(fh, fieldnames = G_domains[0].keys(), delimiter = "\t")


Please include lineterminator="\n" in your writer to prevent spurious diffs.

beckyjackson · 2021-08-05T14:26:58Z

src/scripts/alleles/G_domain.py

+                    else:
+                        raise Exception("Warning other than BiopythonParserWarning: ", log_line)
+
+import csv


Imports should be included at the top of the file.

jamesaoverton · 2021-08-05T14:28:34Z

We have other MRO-specific annotation properties in https://github.com/IEDB/MRO/blob/master/ontology/core.tsv. I don't think we need a new file for properties.

beckyjackson

This is still completely changing ontology/chain-sequence.tsv when I run make update-G-domain. It looks like any lines without a G domain sequence are missing a trailing tab, which gets added when I run the script. Let's make sure that this file is clean before merging this.

When I look at the log file now, I see many errors, for example:

ERROR:root:HLA04761, SHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKMEPRAPWIEQEGPEYWDQETRNMKAHSQTDRANLGTLRGYYNQSEDGSHTIQIMYGCDVGPDGRFLRGYRQDAYDGKDYIALNEDLRS*TAADMAAQITKRKWEAVHAAEQRRVYLEGRCVDGLRRYLENGKETLQRT, SHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKMEPRAPWIEQEGPEYWDQETRNMKAHSQTDRANLGTLRGYYNQSEDGSHTIQIMYGCDVGPDGRFLRGYRQDAYDGKDYIALNEDLRSX not matched

I see this comes from your script - are these true errors or should they just be warnings?

Otherwise, I just requested two small changes in the script.

beckyjackson · 2021-08-09T20:24:01Z

src/scripts/alleles/G_domain.py

+
+#logging.basicConfig(filename = 'build/biopython.log', filemode = 'w', level = logging.DEBUG )
+#logging.basicConfig(filename = 'biopython.log', filemode = 'w', level = logging.INFO )
+logging.basicConfig(filename = 'biopython.log', filemode = 'w', level = logging.WARNING )


This still needs to go in the build/ directory. Further down, you open the log build/biopython.log, which causes a FileNotFoundError.

beckyjackson · 2021-08-09T20:30:18Z

src/scripts/alleles/G_domain.py

+    updated = []
+    reader = csv.DictReader(chain_sequence, delimiter = "\t")
+    robot_string = next(reader)
+    robot_string["minimal HLA G domain sequence"] = "A minimal HLA G domain sequence"


This can be removed, since it's part of the ROBOT template strings now.

beckyjackson

The code runs fine and I believe it does what's expected. I did not review the additions to the chain-sequence sheet though. I don't want to approve this until we know that the content is correct.

jamesaoverton · 2021-08-12T19:05:46Z

Ok, thanks @beckyjackson. I'll review next.

jamesaoverton · 2021-09-30T20:31:32Z

@apmody I'd appreciate it if you can update this branch from master and resolve the merge conflicts.

beckyjackson

I don't think the code has changed since I last reviewed this, but I re-ran it to double check. The chain-sequence file looks good and no spurious diffs were introduced.

I'm still seeing a ton of warnings:

WARNING:py.warnings:/Users/jackson/opt/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py:681: BiopythonParserWarning: EMBL sequence line missing coordinates
  warnings.warn(

I know Biopython can be noisy. I'm not sure if this is an important warning or not (I haven't looked into it), but if it's not, can we suppress it? I guess this request isn't a big deal.

jamesaoverton

I'm seeing just 1000 out of 24,000+ lines that should be in chain-sequence.tsv. There are no HLAs, and so the 'minimal HLA G domain sequence' column is always empty. That can't be right.

apmody added 6 commits July 20, 2021 13:20

First pass extracting G_domain

fe83cb8

Correction of splicing in exon 1. Write to ontology template.

920f9bf

Add ontology templates.

da9faa9

Update Makefile with new templates and rules.

c4bab34

Add ROBOT template string and term to index.tsv

bb3a53e

Add Biopython to requirements.txt

41b2f5c

apmody requested review from rvita and jamesaoverton July 22, 2021 20:33

beckyjackson requested changes Jul 27, 2021

View reviewed changes

apmody added 2 commits July 29, 2021 22:20

Fixed Makefile spelling mistake and minimal HLA G domain

bb5851c

Added Excluded Genes and logging messages

0ea37fb

apmody added 8 commits August 4, 2021 10:26

Refined logging of Biopython warnings

bdce291

Merge branch 'master' into G_domain

e5fb23d

Parse log file

c579246

Update index.tsv and properties.tsv with term G domain term

e347cd5

Add G domain to chain-sequence.tsv, first pass

e97e83b

Add G domain to chain-sequence.tsv, second pass

22e5acc

Add G domain sequence to chain-sequence.tsv

a1c80c5

Removed exclusion of null alleles

7e083e1

jamesaoverton requested a review from beckyjackson August 5, 2021 13:43

beckyjackson requested changes Aug 5, 2021

View reviewed changes

beckyjackson reviewed Aug 5, 2021

View reviewed changes

apmody added 4 commits August 5, 2021 09:41

Remove G-domain-sequence.tsv

d17f958

Removed properties.tsv

a40500f

Changed logging level, moved import csv to top of file.

d9771a4

Removed properties.tsv from Makefile

ce5a9c1

Double import for csv file

4410ae2

apmody requested a review from beckyjackson August 5, 2021 17:56

beckyjackson requested changes Aug 9, 2021

View reviewed changes

Script changes and trailing tab endings fix

0cfc6fe

apmody requested a review from beckyjackson August 11, 2021 22:11

beckyjackson reviewed Aug 12, 2021

View reviewed changes

apmody added 3 commits October 7, 2021 11:44

Merge branch 'master' into G_domain

6839760

Reupdated G domain in ontology/chain-sequence

616f956

Added comments to G domain script.

40bccc5

jamesaoverton requested a review from beckyjackson October 14, 2021 18:13

beckyjackson reviewed Oct 15, 2021

View reviewed changes

jamesaoverton requested changes Oct 26, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add G domain (peptide biding region) sequence to HLA chain terms. #109

Add G domain (peptide biding region) sequence to HLA chain terms. #109

apmody commented Jul 22, 2021

beckyjackson left a comment •

edited

Loading

beckyjackson Jul 27, 2021

beckyjackson Jul 27, 2021

beckyjackson Jul 27, 2021

rvita commented Jul 27, 2021 via email

jamesaoverton commented Aug 3, 2021

beckyjackson left a comment •

edited

Loading

beckyjackson Aug 5, 2021

apmody Aug 5, 2021

beckyjackson Aug 5, 2021

beckyjackson Aug 5, 2021

beckyjackson Aug 5, 2021

jamesaoverton commented Aug 5, 2021

beckyjackson left a comment

beckyjackson Aug 9, 2021

beckyjackson Aug 9, 2021

beckyjackson left a comment •

edited

Loading

jamesaoverton commented Aug 12, 2021

jamesaoverton commented Sep 30, 2021

beckyjackson left a comment

jamesaoverton left a comment

Add G domain (peptide biding region) sequence to HLA chain terms. #109

Are you sure you want to change the base?

Add G domain (peptide biding region) sequence to HLA chain terms. #109

Conversation

apmody commented Jul 22, 2021

beckyjackson left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rvita commented Jul 27, 2021 via email

jamesaoverton commented Aug 3, 2021

beckyjackson left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesaoverton commented Aug 5, 2021

beckyjackson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beckyjackson left a comment • edited Loading

Choose a reason for hiding this comment

jamesaoverton commented Aug 12, 2021

jamesaoverton commented Sep 30, 2021

beckyjackson left a comment

Choose a reason for hiding this comment

jamesaoverton left a comment

Choose a reason for hiding this comment

beckyjackson left a comment •

edited

Loading

beckyjackson left a comment •

edited

Loading

beckyjackson left a comment •

edited

Loading