Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add G domain (peptide biding region) sequence to HLA chain terms. #109

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

apmody
Copy link

@apmody apmody commented Jul 22, 2021

Add G domain sequence for HLA molecules.

@apmody apmody requested review from rvita and jamesaoverton July 22, 2021 20:33
Copy link
Collaborator

@beckyjackson beckyjackson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a few minor fixes, this builds (see specific comments). I think a better table name might be chain-g-domain, but that's not really necessary to change. This data could actually go in chain-sequence instead of a new table, but if we're only doing HLA G domains, then it might be unnecessary.

The ID for the new property you created will need to be updated since there's been updates to MRO since this PR was made.

Finally, running make update-G-domain logs some warnings from biopython. These might be worth investigating:

/Users/jackson/github/MRO/_venv/lib/python3.8/site-packages/Bio/Seq.py:2334: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
  warnings.warn(
/Users/jackson/github/MRO/_venv/lib/python3.8/site-packages/Bio/GenBank/Scanner.py:681: BiopythonParserWarning: EMBL sequence line missing coordinates
  warnings.warn(
['HLA00938',
 SeqRecord(seq=Seq('X'), id='HLA00022.1', name='HLA00022', description='HLA-A*02:17:01, Human MHC Class I sequence Sequence has now been shown to be in error and is identical to A*02:17:02:01 (August 2018)', dbxrefs=[]),
 SeqRecord(seq=Seq('X'), id='HLA02459.1', name='HLA02459', description='HLA-C*07:37:01:01, Human MHC Class I sequence Sequence has now been shown to be in error and is identical to C*07:37:01:02 (October 2020)', dbxrefs=[]),
 SeqRecord(seq=Seq('X'), id='HLA00867.1', name='HLA00867', description='HLA-DRB1*15:02:01:01, Human MHC Class II sequence Sequence has now been shown to be in error and is identical to DRB1*15:02:01:02 (March 2021)', dbxrefs=[]),
 SeqRecord(seq=Seq('X'), id='HLA24174.1', name='HLA24174', description='HLA-DQA1*02:06, Human MHC Class II sequence Sequence has now been shown to be in error and is identical to DQA1*02:01:01:01 (September 2020)', dbxrefs=[]),
 SeqRecord(seq=Seq('X'), id='HLA00624.1', name='HLA00624', description='HLA-DQB1*02:03:01, Human MHC Class II sequence Sequence extended and renamed DQB1*02:180 (November 2020)', dbxrefs=[]),
 SeqRecord(seq=Seq('X'), id='HLA09735.1', name='HLA09735', description='HLA-DQB1*06:92:01, Human MHC Class II sequence Sequence extended and renamed DQB1*06:385 (November 2020)', dbxrefs=[]),
 SeqRecord(seq=Seq('X'), id='HLA00508.1', name='HLA00508', description='HLA-DPA1*02:02:01, Human MHC Class II sequence Sequence has now been shown to be in error and is identical to DPA1*02:07:01:01 (March 2017)', dbxrefs=[])]

with open("ontology/G-domain-sequence.tsv", "w") as fh:
writer = csv.DictWriter(fh, fieldnames = G_domains[0].keys(), delimiter = "\t")
writer.writeheader()
fh.write("LABEL\tA minimal G domain sequence")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a \n at the end of this line, otherwise your first row ends up on the same line as the ROBOT template strings.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also match the label in ontology/properties.tsv, which is "minimal HLA G domain sequence".

Makefile Outdated
@@ -232,6 +235,11 @@ build/AlleleList.txt: | build
update-alleles: src/scripts/alleles/update_human_alleles.py ontology/chain-sequence.tsv ontology/chain.tsv ontology/molecule.tsv ontology/genetic-locus.tsv index.tsv build/hla_prot.fasta build/AlleleList.txt
python3 $^

.PHONY: update-G-domiain
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be update-G-domain, also remove the blank line after this please

@rvita
Copy link
Collaborator

rvita commented Jul 27, 2021 via email

@jamesaoverton
Copy link
Collaborator

#112 has been merged. @apmody please update this PR from master and address all the requested changes.

Copy link
Collaborator

@beckyjackson beckyjackson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please add the G domain sequences to the chain-sequence sheet instead of a new sheet, as Randi requested. Your script should update this sheet as well.
  2. Do we need a properties sheet? If we put the "minimal HLA G domain sequence" in the index, it will have its label, but it just won't have a definition. It's one less sheet to keep track of and I would prefer not to add more sheets, but I'd like to get @jamesaoverton's opinion. EDIT: Please remove the properties sheet.
  3. The make update-G-domain process now hangs for me. I can't kill it with ctrl+C either (removing pdb.set_trace() resolves this - pdb should be removed from this anyway):
python3 src/scripts/alleles/G_domain.py
> /Users/jackson/github/MRO/src/scripts/alleles/G_domain.py(161)<module>()
-> with open("biopython.log", "r") as log_file:
(Pdb) 
(Pdb) --KeyboardInterrupt--
  1. Running make update-G-domain completely changes the G-domain-sequence sheet, so you can't tell what has actually changed when running git diff. The diff should only show changes, not line reordering, line terminator changes, whitespace, etc... I would expect running this to not result in any changes right now since it should be up-to-date. I left a comment in the script that should resolve the line terminator issue, but I still see reordering of the lines when I run it with this added.
  2. I can't tell if the other warnings from my original review were resolved because there's too much going on in the log file. I think a lot of this logging is unnecessary.

"DRB2"
}

logging.basicConfig(filename = 'biopython.log', filemode = 'w', level = logging.DEBUG )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log file should probably go in the build directory. This will log everything from both Biopython and this script, so biopython.log is not the most informative name.

Is there a reason you're using a log file instead of logging to the console? Looking at the log file, there is a LOT of stuff and it's hard to tell what is relevant. For example, I see a lot of:

WARNING:py.warnings:/Users/jackson/opt/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py:681: BiopythonParserWarning: EMBL sequence line missing coordinates
  warnings.warn(

There's also almost 50,000 lines in the log file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have mentioned this before. This warning, when I am parsing the build/hla.dat file specifically happens when there is no nucleotide sequence like here. I spent some time trying to find a way around this, but I couldn't because the problem is with how Biopython parses the file. The rest of the lines are just there to help in debugging (specifically which entry in build/hla.dat is causing problems). I am pushing an update to fix this.

excluded_sequence.append(entry)
logging.info(entry.name + " ending")

import pdb; pdb.set_trace()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove pdb


import csv
with open("ontology/G-domain-sequence.tsv", "w") as fh:
writer = csv.DictWriter(fh, fieldnames = G_domains[0].keys(), delimiter = "\t")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include lineterminator="\n" in your writer to prevent spurious diffs.

else:
raise Exception("Warning other than BiopythonParserWarning: ", log_line)

import csv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imports should be included at the top of the file.

@jamesaoverton
Copy link
Collaborator

We have other MRO-specific annotation properties in https://github.com/IEDB/MRO/blob/master/ontology/core.tsv. I don't think we need a new file for properties.

@apmody apmody requested a review from beckyjackson August 5, 2021 17:56
Copy link
Collaborator

@beckyjackson beckyjackson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still completely changing ontology/chain-sequence.tsv when I run make update-G-domain. It looks like any lines without a G domain sequence are missing a trailing tab, which gets added when I run the script. Let's make sure that this file is clean before merging this.

When I look at the log file now, I see many errors, for example:

ERROR:root:HLA04761, SHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKMEPRAPWIEQEGPEYWDQETRNMKAHSQTDRANLGTLRGYYNQSEDGSHTIQIMYGCDVGPDGRFLRGYRQDAYDGKDYIALNEDLRS*TAADMAAQITKRKWEAVHAAEQRRVYLEGRCVDGLRRYLENGKETLQRT, SHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKMEPRAPWIEQEGPEYWDQETRNMKAHSQTDRANLGTLRGYYNQSEDGSHTIQIMYGCDVGPDGRFLRGYRQDAYDGKDYIALNEDLRSX not matched

I see this comes from your script - are these true errors or should they just be warnings?

Otherwise, I just requested two small changes in the script.


#logging.basicConfig(filename = 'build/biopython.log', filemode = 'w', level = logging.DEBUG )
#logging.basicConfig(filename = 'biopython.log', filemode = 'w', level = logging.INFO )
logging.basicConfig(filename = 'biopython.log', filemode = 'w', level = logging.WARNING )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to go in the build/ directory. Further down, you open the log build/biopython.log, which causes a FileNotFoundError.

updated = []
reader = csv.DictReader(chain_sequence, delimiter = "\t")
robot_string = next(reader)
robot_string["minimal HLA G domain sequence"] = "A minimal HLA G domain sequence"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed, since it's part of the ROBOT template strings now.

@apmody apmody requested a review from beckyjackson August 11, 2021 22:11
Copy link
Collaborator

@beckyjackson beckyjackson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code runs fine and I believe it does what's expected. I did not review the additions to the chain-sequence sheet though. I don't want to approve this until we know that the content is correct.

@jamesaoverton
Copy link
Collaborator

Ok, thanks @beckyjackson. I'll review next.

@jamesaoverton
Copy link
Collaborator

@apmody I'd appreciate it if you can update this branch from master and resolve the merge conflicts.

Copy link
Collaborator

@beckyjackson beckyjackson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the code has changed since I last reviewed this, but I re-ran it to double check. The chain-sequence file looks good and no spurious diffs were introduced.

I'm still seeing a ton of warnings:

WARNING:py.warnings:/Users/jackson/opt/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py:681: BiopythonParserWarning: EMBL sequence line missing coordinates
  warnings.warn(

I know Biopython can be noisy. I'm not sure if this is an important warning or not (I haven't looked into it), but if it's not, can we suppress it? I guess this request isn't a big deal.

Copy link
Collaborator

@jamesaoverton jamesaoverton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing just 1000 out of 24,000+ lines that should be in chain-sequence.tsv. There are no HLAs, and so the 'minimal HLA G domain sequence' column is always empty. That can't be right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants