-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update scripts #121
base: master
Are you sure you want to change the base?
Update scripts #121
Conversation
When I run this, I see 11771 new terms added to MRO. Is this expected? The script also prints a lot of things. It's printing out some dictionaries that I don't think are useful to log (maybe for debugging?), but I also get this message multiple times:
|
The 11k sequences seems right but only because I am not restricting to 2 fields for the HLA alleles (meaning there are probably duplicates HLA-A01:01 -> HLA-A01:01:01 HLA-A*01:01:02). These are technically distinct alleles but often times these mutations they refer to have no effect on binding and they are effectively the same. We went back and forth too much and I don't know what's best. We settled on 2 fields when I was working on HLA, but when the tools team needed the terms, they needed the full fields. That can be altered on Line 79. For the print statements, you can definitely remove them as they were for debugging. I was in a bit of a rush to make this PR before I left so my apologies :^( that message about HLA-DR also seems to be a bit off, it just suggests there's something weird going on between the JSON datafile and what the script finds in the database. I can debug after work if you'd like. |
is this the g domains? if so, I'd like to see a list of those added, as
well as the full seqs for comparison.
…On Fri, Oct 15, 2021 at 9:23 AM Becky Jackson ***@***.***> wrote:
When I run this, I see 11771 new terms added to MRO. Is this expected?
The script also prints a lot of things. It's printing out some
dictionaries that I don't think are useful to log, but I also get this
message multiple times:
HLA-DRA not available in locus-data.json, please add it.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#121 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADKDY42SRUBP5G7FYXMDNVDUHBIR5ANCNFSM5E3PKUIQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Randi Vita, M.D.
Lead Ontology and Quality Manager
Immune Epitope Database and Analysis Project
La Jolla Institute for Allergy & Immunology
9420 Athena Circle
La Jolla, Ca 92037
***@***.***
www.immuneepitope.org
858-752-6912
|
@rvita these are the full sequences. Maybe Apurva can combine his G domain work with mine to clear up the multiple fields issue, but this still doesn't address the tools team's needs. |
The only comments I have for this are related to things that we have gone back and forth on:
This script currently:
Both of these can be altered very easily if this isn't desired behavior
Line 79 for number of fields
allele = (":").join(allele.split(":")[:4])
Line 133 can be copied and changed to remove sequences with suffixes i.e. BoLA-1*01:01N
missing_alleles = {x for x in missing_alleles if x[-1] != 'N'}