Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HGNC Values are Missing on Some Transcripts #217

Open
akeeeshi opened this issue Mar 19, 2019 · 7 comments
Open

HGNC Values are Missing on Some Transcripts #217

akeeeshi opened this issue Mar 19, 2019 · 7 comments
Labels
bug Something isn't working

Comments

@akeeeshi
Copy link

Our team recently noticed that for a small subset of transcripts within UTA the hgnc field is empty. See entry below comparing the record for the transcript of BRAF vs. MFSD11 gene.

image

Upon investigating further our team was able to discover another ~36 transcripts that have this issue using a SQL query where hgnc == “”

image

I was wondering what your thoughts were on what the genesis of this discrepancy could be? It seems as if refseq is up to date in terms of associating this transcript to the MFSD11 gene. Trying my best to ascertain if this something where the source of the UTA data would need to be fixed or an issue with process of how the UTA db is build?

As a note, we also checked this issue in older versions of UTA and they appear to occur there as well.

@reece
Copy link
Member

reece commented Mar 20, 2019

I'm surprised by your finding. Based on the query below, this appears to be an historical artifact that affected a small number of transcripts loaded before Aug 2016.

I have historical "txinfo" files that contain the data that were actually loaded and the gene is indeed blank in some cases. (BTW, I wish this were NULL and not blank, but that's an aside.)

The txinfo files are made by merging several files from NCBI, one of which is to add gene symbols to the txinfo. Unfortunately, the released files from NCBI were not coordinated back then, so snapshots of those files were sometimes inconsistent. For example, one version of a transcript might be in one file, and a different in another. So, my best guess for what happened here is that these particular transcripts did not have transcript-gene associations at the time. The data loader will handle cases where the transcript exists and the associated gene symbol changes, so I think the cases you found are first created when the gene symbol doesn't exist, and they persist because that transcript was not reloaded before it became deprecated. Furthermore, because it didn't happen at all between 2016-2018, I think something about the process became fixed. This is all a guess of the mechanism.

I am actively thinking about completely rewriting UTA to streamline loading so that it's easier to keep UTA up-to-date.

We could update transcripts to add symbols where missing. Would that be helpful?

Thanks,
Reece

anonymous@uta/uta=> select min(added),max(added),count(distinct added) n_dates, count(*) as n_transcripts, hgnc = '' as hgnc_is_blank from uta_20180821.transcript group by 5;
┌────────────────────────────┬────────────────────────────┬─────────┬───────────────┬───────────────┐
│            min             │            max             │ n_dates │ n_transcripts │ hgnc_is_blank │
├────────────────────────────┼────────────────────────────┼─────────┼───────────────┼───────────────┤
│ 2014-02-11 00:00:18.453854 │ 2018-08-22 08:52:41.710398 │      22 │        249873 │ f             │
│ 2014-02-11 00:00:18.453854 │ 2016-08-26 16:50:08.119785 │       4 │            36 │ t             │
└────────────────────────────┴────────────────────────────┴─────────┴───────────────┴───────────────┘

@akeeeshi
Copy link
Author

Updating the transcript to add the symbols would be awesome! Would this be part of new UTA release coming down the road?

As a side note, please let us know if there is any way we can be helpful if you choose to rewrite (QA, testing, etc.) This project has been immensely valuable to our organization so we want to contribute back in whatever way seems most helpful.

@reece
Copy link
Member

reece commented Mar 22, 2019

Definitely part of a new release and carried into future releases.

Thank you for the offer. I will take you up on PRs and offers to help eventually.

Copy link

github-actions bot commented Dec 6, 2023

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale Issue is stale and subject to automatic closing label Dec 6, 2023
Copy link

This issue was closed because it has been stalled for 7 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 13, 2023
@reece reece reopened this Feb 19, 2024
@reece reece added resurrected and removed stale Issue is stale and subject to automatic closing labels Feb 19, 2024
@reece
Copy link
Member

reece commented Feb 19, 2024

This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event.

Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale Issue is stale and subject to automatic closing label May 20, 2024
@jsstevenson jsstevenson added bug Something isn't working and removed stale Issue is stale and subject to automatic closing resurrected labels May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants