-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add G domain (peptide biding region) sequence to HLA chain terms. #109
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a few minor fixes, this builds (see specific comments). I think a better table name might be chain-g-domain
, but that's not really necessary to change. This data could actually go in chain-sequence
instead of a new table, but if we're only doing HLA G domains, then it might be unnecessary.
The ID for the new property you created will need to be updated since there's been updates to MRO since this PR was made.
Finally, running make update-G-domain
logs some warnings from biopython. These might be worth investigating:
/Users/jackson/github/MRO/_venv/lib/python3.8/site-packages/Bio/Seq.py:2334: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
warnings.warn(
/Users/jackson/github/MRO/_venv/lib/python3.8/site-packages/Bio/GenBank/Scanner.py:681: BiopythonParserWarning: EMBL sequence line missing coordinates
warnings.warn(
['HLA00938',
SeqRecord(seq=Seq('X'), id='HLA00022.1', name='HLA00022', description='HLA-A*02:17:01, Human MHC Class I sequence Sequence has now been shown to be in error and is identical to A*02:17:02:01 (August 2018)', dbxrefs=[]),
SeqRecord(seq=Seq('X'), id='HLA02459.1', name='HLA02459', description='HLA-C*07:37:01:01, Human MHC Class I sequence Sequence has now been shown to be in error and is identical to C*07:37:01:02 (October 2020)', dbxrefs=[]),
SeqRecord(seq=Seq('X'), id='HLA00867.1', name='HLA00867', description='HLA-DRB1*15:02:01:01, Human MHC Class II sequence Sequence has now been shown to be in error and is identical to DRB1*15:02:01:02 (March 2021)', dbxrefs=[]),
SeqRecord(seq=Seq('X'), id='HLA24174.1', name='HLA24174', description='HLA-DQA1*02:06, Human MHC Class II sequence Sequence has now been shown to be in error and is identical to DQA1*02:01:01:01 (September 2020)', dbxrefs=[]),
SeqRecord(seq=Seq('X'), id='HLA00624.1', name='HLA00624', description='HLA-DQB1*02:03:01, Human MHC Class II sequence Sequence extended and renamed DQB1*02:180 (November 2020)', dbxrefs=[]),
SeqRecord(seq=Seq('X'), id='HLA09735.1', name='HLA09735', description='HLA-DQB1*06:92:01, Human MHC Class II sequence Sequence extended and renamed DQB1*06:385 (November 2020)', dbxrefs=[]),
SeqRecord(seq=Seq('X'), id='HLA00508.1', name='HLA00508', description='HLA-DPA1*02:02:01, Human MHC Class II sequence Sequence has now been shown to be in error and is identical to DPA1*02:07:01:01 (March 2017)', dbxrefs=[])]
src/scripts/alleles/G_domain.py
Outdated
with open("ontology/G-domain-sequence.tsv", "w") as fh: | ||
writer = csv.DictWriter(fh, fieldnames = G_domains[0].keys(), delimiter = "\t") | ||
writer.writeheader() | ||
fh.write("LABEL\tA minimal G domain sequence") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a \n
at the end of this line, otherwise your first row ends up on the same line as the ROBOT template strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also match the label in ontology/properties.tsv
, which is "minimal HLA G domain sequence".
Makefile
Outdated
@@ -232,6 +235,11 @@ build/AlleleList.txt: | build | |||
update-alleles: src/scripts/alleles/update_human_alleles.py ontology/chain-sequence.tsv ontology/chain.tsv ontology/molecule.tsv ontology/genetic-locus.tsv index.tsv build/hla_prot.fasta build/AlleleList.txt | |||
python3 $^ | |||
|
|||
.PHONY: update-G-domiain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be update-G-domain
, also remove the blank line after this please
i thought this info was going in chain-sequence as discussed in the
ontology mtg
…On Tue, Jul 27, 2021 at 11:56 AM Becky Jackson ***@***.***> wrote:
@beckyjackson requested changes on this pull request.
With a few minor fixes, this builds (see specific comments). I think a better table name might be chain-g-domain, but that's not really necessary to change.
This data could actually go in chain-sequence instead of a new table, but if we're only doing HLA G domains, then it might be unnecessary.
________________________________
In src/scripts/alleles/G_domain.py:
> + else:
+ G_domain = G_domain
+ G_domains.append({"Label": data[entry.name], "minimal G domain sequence" : G_domain})
+ except BiopythonWarning:
+ print("BiopythonWarning")
+ except AttributeError:
+ if str(entry.seq) == 'X':
+ excluded_sequence.append(entry)
+
+print(excluded_sequence)
+
+import csv
+with open("ontology/G-domain-sequence.tsv", "w") as fh:
+ writer = csv.DictWriter(fh, fieldnames = G_domains[0].keys(), delimiter = "\t")
+ writer.writeheader()
+ fh.write("LABEL\tA minimal G domain sequence")
This needs a \n at the end of this line, otherwise your first row ends up on the same line as the ROBOT template strings.
________________________________
In Makefile:
> @@ -232,6 +235,11 @@ build/AlleleList.txt: | build
update-alleles: src/scripts/alleles/update_human_alleles.py ontology/chain-sequence.tsv ontology/chain.tsv ontology/molecule.tsv ontology/genetic-locus.tsv index.tsv build/hla_prot.fasta build/AlleleList.txt
python3 $^
+.PHONY: update-G-domiain
Should be update-G-domain, also remove the blank line after this please
________________________________
In src/scripts/alleles/G_domain.py:
> + else:
+ G_domain = G_domain
+ G_domains.append({"Label": data[entry.name], "minimal G domain sequence" : G_domain})
+ except BiopythonWarning:
+ print("BiopythonWarning")
+ except AttributeError:
+ if str(entry.seq) == 'X':
+ excluded_sequence.append(entry)
+
+print(excluded_sequence)
+
+import csv
+with open("ontology/G-domain-sequence.tsv", "w") as fh:
+ writer = csv.DictWriter(fh, fieldnames = G_domains[0].keys(), delimiter = "\t")
+ writer.writeheader()
+ fh.write("LABEL\tA minimal G domain sequence")
This should also match the label in ontology/properties.tsv, which is "minimal HLA G domain sequence".
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub, or unsubscribe.
--
Randi Vita, M.D.
Lead Ontology and Quality Manager
Immune Epitope Database and Analysis Project
La Jolla Institute for Allergy & Immunology
9420 Athena Circle
La Jolla, Ca 92037
***@***.***
www.immuneepitope.org
858-752-6912
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please add the G domain sequences to the
chain-sequence
sheet instead of a new sheet, as Randi requested. Your script should update this sheet as well. Do we need aEDIT: Please remove theproperties
sheet? If we put the "minimal HLA G domain sequence" in the index, it will have its label, but it just won't have a definition. It's one less sheet to keep track of and I would prefer not to add more sheets, but I'd like to get @jamesaoverton's opinion.properties
sheet.- The
make update-G-domain
process now hangs for me. I can't kill it with ctrl+C either (removingpdb.set_trace()
resolves this -pdb
should be removed from this anyway):
python3 src/scripts/alleles/G_domain.py
> /Users/jackson/github/MRO/src/scripts/alleles/G_domain.py(161)<module>()
-> with open("biopython.log", "r") as log_file:
(Pdb)
(Pdb) --KeyboardInterrupt--
- Running
make update-G-domain
completely changes theG-domain-sequence
sheet, so you can't tell what has actually changed when runninggit diff
. The diff should only show changes, not line reordering, line terminator changes, whitespace, etc... I would expect running this to not result in any changes right now since it should be up-to-date. I left a comment in the script that should resolve the line terminator issue, but I still see reordering of the lines when I run it with this added. - I can't tell if the other warnings from my original review were resolved because there's too much going on in the log file. I think a lot of this logging is unnecessary.
src/scripts/alleles/G_domain.py
Outdated
"DRB2" | ||
} | ||
|
||
logging.basicConfig(filename = 'biopython.log', filemode = 'w', level = logging.DEBUG ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log file should probably go in the build
directory. This will log everything from both Biopython and this script, so biopython.log
is not the most informative name.
Is there a reason you're using a log file instead of logging to the console? Looking at the log file, there is a LOT of stuff and it's hard to tell what is relevant. For example, I see a lot of:
WARNING:py.warnings:/Users/jackson/opt/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py:681: BiopythonParserWarning: EMBL sequence line missing coordinates
warnings.warn(
There's also almost 50,000 lines in the log file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should have mentioned this before. This warning, when I am parsing the build/hla.dat file specifically happens when there is no nucleotide sequence like here. I spent some time trying to find a way around this, but I couldn't because the problem is with how Biopython parses the file. The rest of the lines are just there to help in debugging (specifically which entry in build/hla.dat is causing problems). I am pushing an update to fix this.
src/scripts/alleles/G_domain.py
Outdated
excluded_sequence.append(entry) | ||
logging.info(entry.name + " ending") | ||
|
||
import pdb; pdb.set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove pdb
src/scripts/alleles/G_domain.py
Outdated
|
||
import csv | ||
with open("ontology/G-domain-sequence.tsv", "w") as fh: | ||
writer = csv.DictWriter(fh, fieldnames = G_domains[0].keys(), delimiter = "\t") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include lineterminator="\n"
in your writer to prevent spurious diffs.
src/scripts/alleles/G_domain.py
Outdated
else: | ||
raise Exception("Warning other than BiopythonParserWarning: ", log_line) | ||
|
||
import csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imports should be included at the top of the file.
We have other MRO-specific annotation properties in https://github.com/IEDB/MRO/blob/master/ontology/core.tsv. I don't think we need a new file for properties. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still completely changing ontology/chain-sequence.tsv
when I run make update-G-domain
. It looks like any lines without a G domain sequence are missing a trailing tab, which gets added when I run the script. Let's make sure that this file is clean before merging this.
When I look at the log file now, I see many errors, for example:
ERROR:root:HLA04761, SHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKMEPRAPWIEQEGPEYWDQETRNMKAHSQTDRANLGTLRGYYNQSEDGSHTIQIMYGCDVGPDGRFLRGYRQDAYDGKDYIALNEDLRS*TAADMAAQITKRKWEAVHAAEQRRVYLEGRCVDGLRRYLENGKETLQRT, SHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKMEPRAPWIEQEGPEYWDQETRNMKAHSQTDRANLGTLRGYYNQSEDGSHTIQIMYGCDVGPDGRFLRGYRQDAYDGKDYIALNEDLRSX not matched
I see this comes from your script - are these true errors or should they just be warnings?
Otherwise, I just requested two small changes in the script.
src/scripts/alleles/G_domain.py
Outdated
|
||
#logging.basicConfig(filename = 'build/biopython.log', filemode = 'w', level = logging.DEBUG ) | ||
#logging.basicConfig(filename = 'biopython.log', filemode = 'w', level = logging.INFO ) | ||
logging.basicConfig(filename = 'biopython.log', filemode = 'w', level = logging.WARNING ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still needs to go in the build/
directory. Further down, you open the log build/biopython.log
, which causes a FileNotFoundError
.
src/scripts/alleles/G_domain.py
Outdated
updated = [] | ||
reader = csv.DictReader(chain_sequence, delimiter = "\t") | ||
robot_string = next(reader) | ||
robot_string["minimal HLA G domain sequence"] = "A minimal HLA G domain sequence" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed, since it's part of the ROBOT template strings now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code runs fine and I believe it does what's expected. I did not review the additions to the chain-sequence
sheet though. I don't want to approve this until we know that the content is correct.
Ok, thanks @beckyjackson. I'll review next. |
@apmody I'd appreciate it if you can update this branch from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the code has changed since I last reviewed this, but I re-ran it to double check. The chain-sequence file looks good and no spurious diffs were introduced.
I'm still seeing a ton of warnings:
WARNING:py.warnings:/Users/jackson/opt/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py:681: BiopythonParserWarning: EMBL sequence line missing coordinates
warnings.warn(
I know Biopython can be noisy. I'm not sure if this is an important warning or not (I haven't looked into it), but if it's not, can we suppress it? I guess this request isn't a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing just 1000 out of 24,000+ lines that should be in chain-sequence.tsv
. There are no HLAs, and so the 'minimal HLA G domain sequence' column is always empty. That can't be right.
Add G domain sequence for HLA molecules.