-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RefCDS discrepancy when loading human GRCh38? #32
Comments
Hello, Thank you for your interest in dNdScv. I suspect that the problem is that you are using chromosome names as "1", "2"... while the GRCh38 convention in Ensembl is using the chr prefix: "chr1", "chr2". As a result, dndscv does not find any overlaps. I will add a warning in the code to flag this, and I will consider automatically converting chromosome names in the future, although I have been reluctant to do this to avoid possible problems with user-defined names for other species. For now, just try: Also, please note that you do not need to restrict your input table of mutations to coding mutations. The dndscv function does that at the start using its own set of transcripts. It is better to feed all mutations, coding and noncoding, to dndscv and let the function do the filtering for you to avoid introducing any biases accidentally by using different transcript definitions. I hope this helps! |
I have now added a more informative error message based on previous feedback from users: "Zero coding substitutions found in this dataset. Unable to run dndscv. Common causes for this error are inputting only indels or using chromosome names different to those in the reference database (e.g. chr1 vs 1)" |
By the way, you can see the chromosome names used by RefCDS using: |
I'm trying to run dndscv with Human GRCh38 as the reference. I've filtered my data to only include exonic SNVs and formatted it like this:
...with all columns being "character" class, except for pos which is class "integer".
I've downloaded the file you provided, RefCDS_human_GRCh38.p12.rda, and attempted to run the following command:
dndsout = dndscv(v2,refdb = "~/Downloads/RefCDS_human_GRCh38.p12.rda", cv=NULL)
...which returns an error:
After reading this Biostars post and this issue, I decided to manually check for overlap between my mutations and the GRCh38 reference file you provided. This lead me to discover a strange discrepancy: with an empty environment I ran:
load("~/Downloads/RefCDS_human_GRCh38.p12.rda")
...which created two objects named "gr_genes" and "RefCDS". When I examine the RefCDS object in Rstudio by first clicking on it in the upper-right "environment" window then clicking on the 4th element it appears to show the gene A2ML1 with coordinates from GRCh37; specifically the "intervals_cds" vector contains the following values:
However, when I query the same gene using the console it returns coordinates like I would expect:
I'm wondering if this discrepancy could somehow be causing the error message I'm getting. My dataset definitely contains non-synonymous A2ML1 mutations; the following command demonstrates this using GRCh38 coordinates from the previous command:
Although only the bottom row of this output is a non-synonymous SNV, it might be worth mentioning that roughly half the dataset is non-synonymous SNVs.
Here's what I'm working with:
The text was updated successfully, but these errors were encountered: