Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update scripts #121

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Update scripts #121

wants to merge 2 commits into from

Conversation

acrinklaw
Copy link

The only comments I have for this are related to things that we have gone back and forth on:

This script currently:

  • Accepts alleles > 2 fields long
  • Takes alleles with suffixes

Both of these can be altered very easily if this isn't desired behavior
Line 79 for number of fields
allele = (":").join(allele.split(":")[:4])

Line 133 can be copied and changed to remove sequences with suffixes i.e. BoLA-1*01:01N
missing_alleles = {x for x in missing_alleles if x[-1] != 'N'}

@beckyjackson
Copy link
Collaborator

beckyjackson commented Oct 15, 2021

When I run this, I see 11771 new terms added to MRO. Is this expected?

The script also prints a lot of things. It's printing out some dictionaries that I don't think are useful to log (maybe for debugging?), but I also get this message multiple times:

HLA-DRA not available in locus-data.json, please add it.

@acrinklaw
Copy link
Author

When I run this, I see 11771 new terms added to MRO. Is this expected?

The script also prints a lot of things. It's printing out some dictionaries that I don't think are useful to log (maybe for debugging?), but I also get this message multiple times:

HLA-DRA not available in locus-data.json, please add it.

The 11k sequences seems right but only because I am not restricting to 2 fields for the HLA alleles (meaning there are probably duplicates HLA-A01:01 -> HLA-A01:01:01 HLA-A*01:01:02). These are technically distinct alleles but often times these mutations they refer to have no effect on binding and they are effectively the same. We went back and forth too much and I don't know what's best. We settled on 2 fields when I was working on HLA, but when the tools team needed the terms, they needed the full fields. That can be altered on Line 79.

For the print statements, you can definitely remove them as they were for debugging. I was in a bit of a rush to make this PR before I left so my apologies :^( that message about HLA-DR also seems to be a bit off, it just suggests there's something weird going on between the JSON datafile and what the script finds in the database. I can debug after work if you'd like.

@rvita
Copy link
Collaborator

rvita commented Oct 15, 2021 via email

@acrinklaw
Copy link
Author

@rvita these are the full sequences. Maybe Apurva can combine his G domain work with mine to clear up the multiple fields issue, but this still doesn't address the tools team's needs.

@jamesaoverton jamesaoverton self-requested a review November 29, 2021 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants