Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sex determination #182

Open
AnjaliC4 opened this issue Apr 10, 2021 · 8 comments
Open

sex determination #182

AnjaliC4 opened this issue Apr 10, 2021 · 8 comments

Comments

@AnjaliC4
Copy link

AnjaliC4 commented Apr 10, 2021

Hi,
Thanks for this very cool tool!

Could you please suggest a way to determine sex of every sample in vcfR?
One approach I know is heterozygosity rate on chrX. However, not sure how to calclaute that in vcfR- would appreciate your advise.

In addition, could you please tell me the behavior of extract.gt(element="PL", as.numeric="TRUE"). I get numeric values but not sure how they go from e.g. 0:20:30 to one number.
Thanks a lot!

@knausb
Copy link
Owner

knausb commented Apr 11, 2021 via email

@AnjaliC4
Copy link
Author

Hi Brian,

I am working with humans. I am trying to figure out a way to determine chrX heterozygosity using "PL" which as you guessed, is absolutely correct, is the "Phred scaled likelihood." Yes, it does produce a numeric matrix, however, PL values are usually in for exmaple 0:20:40 format, corresponding to 0/0:0/1:1/1 genotypes. I am unsure how "extract.gt(element="PL", as.numeric="TRUE")" converts 0:20:40 to a number? Is it the lowest number that gets extracted as that represents the genotype likelihood?

Thank you
Anjali

@knausb
Copy link
Owner

knausb commented Apr 13, 2021

Hi AnjaliC4,

I think you want something like the following.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.12.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****

data("vcfR_example")
vcf
#> ***** Object of Class vcfR *****
#> 18 samples
#> 1 CHROMs
#> 2,533 variants
#> Object size: 3.2 Mb
#> 8.497 percent missing data
#> *****        *****         *****
vcf@gt[1:3, 1:4]
#>      FORMAT           BL2009P4_us23               DDR7602                  
#> [1,] "GT:AD:DP:GQ:PL" "0|0:62,0:62:99:0,190,2835" "0|0:12,0:12:39:0,39,585"
#> [2,] "GT:AD:DP:GQ:PL" "1|0:5,5:10:99:111,0,114"   NA                       
#> [3,] "GT:AD:DP:GQ:PL" NA                          NA                       
#>      IN2009T1_us22              
#> [1,] "0|0:37,0:37:99:0,114,1709"
#> [2,] "0|1:2,1:3:16:16,0,48"     
#> [3,] "0|0:2,0:2:6:0,6,51"
pl <- extract.gt(vcf, element = "PL")
pl[1:4, 1:6]
#>                      BL2009P4_us23 DDR7602    IN2009T1_us22 LBUS5      NL07434 
#> Supercontig_1.50_2   "0,190,2835"  "0,39,585" "0,114,1709"  "0,39,585" NA      
#> Supercontig_1.50_246 "111,0,114"   NA         "16,0,48"     NA         NA      
#> Supercontig_1.50_549 NA            NA         "0,6,51"      NA         NA      
#> Supercontig_1.50_668 "0,3,44"      NA         "25,3,0"      NA         "0,3,28"
#>                      P10127    
#> Supercontig_1.50_2   "0,24,360"
#> Supercontig_1.50_246 "0,3,25"  
#> Supercontig_1.50_549 "99,12,0" 
#> Supercontig_1.50_668 "0,6,54"

pl1 <- masplit(pl, record = 1, sort = 0)
pl1[1:3, 1:4]
#>                      BL2009P4_us23 DDR7602 IN2009T1_us22 LBUS5
#> Supercontig_1.50_2               0       0             0     0
#> Supercontig_1.50_246           111      NA            16    NA
#> Supercontig_1.50_549            NA      NA             0    NA

Created on 2021-04-13 by the reprex package (v1.0.0)

This should give you the first PL element as a numeric.

But is the question you're trying to address is if chromX is heterozygous? Or are you trying to assess the quality of the genotypes? Because these are scaled, "0" is always the lowest. Just curious.

@AnjaliC4
Copy link
Author

Hi Brian,

Thanks for the example! I am trying to address if chrX is heterozygous. I believe in that case, I should extract the second element using mapsplit? Since, PL values are usually in 0/0:0/1:1/1 format. Please let me know what you think.
Thank you.

@knausb
Copy link
Owner

knausb commented Apr 14, 2021

Hi AnjaliC4,

PL is a comma delimited list of integers, one for each potential genotype. You could use this information to explore the confidence of the genotype call. Only one genotype is called per sample and per variant. So I feel your last statement was incorrect. If you just want the genotypes you could use the following.

gt <- extract.gt(vcf, element = "GT")

@AnjaliC4
Copy link
Author

Hi Brian,
Thanks for the clarification. You are right. I am wondering in that case, how could I assess heterozygosity for a sample? Since GT may not tell anything about the quality of the genotype call. I also work with very sparse data. Is it possible to extract SNPs above a threshold for PL, to get genotypes called with high confidence? I could use only those above the threshold for estimating heterozygosity?

@knausb
Copy link
Owner

knausb commented Apr 14, 2021

Hi AnjaliC4,

It appears to me that you have two questions here. First, is X heterozygous. Second, What is the quality of the genotypes.

If you ask your variant caller to call diploid variants on X you would expect some level of heterozygosity in a female because there are two copies of X. In males X is haploid so we would expect it to be homozygous. However, in genomic data we deal with lots of data so we would expect false positives. One way to manage this is to count the number of heterozygous positions on each chromosome and divide by it's length, giving us the rate of heterozygosity. I would expect that the autosomes would have a similar rate of heterozygosity. A female would have a rate of heterozygosity that is similar to the autosomes while a male X should have a decreased level of heterozygosity.

The "quality" of a genotype is much more difficult to estimate. I like to use depth and you can see how I use this at the below link.

https://knausb.github.io/vcfR_documentation/quick_intro.html

You seem to be interested in PL. One way of doing this would be to compare the best likelihood to the second best. A large difference would indicate confidence while a small difference would suggest ambiguity. With "phred scaled likelihoods" the "best' likelihood is always zero. So you can just use the value of the second lowest value. It has been my experience that repetitive regions that are poorly assembled have high coverage, which can lead to good likelihoods. My interpretation is that maximum likelihood does not understand copy number variation. So you'll have to make a judgement call here.

Good luck!
Brian

@AnjaliC4
Copy link
Author

Thank you very much, Brian for your opinions and advice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants