-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vcfR2genlight failed to convert polyploid of mixed ploidy vcf (obtained by GATK) #158
Comments
@knausb I can not find your email address. Hence, I attached my test data(unpublished) here. Once you download it I will delete it. |
Hi @lihe-s , apologies for the slow response! R 4.0.0 was released, dplyr 1.0.0 is getting ready for release. And there's this COVID-19 thing. And I just accepted a new job. So things have been a bit busy. Here's what I see. library(vcfR)
#>
#> ***** *** vcfR *** *****
#> This is vcfR 1.11.0
#> browseVignettes('vcfR') # Documentation
#> citation('vcfR') # Citation
#> ***** ***** ***** *****
vcf <- read.vcfR("~/Desktop/sandbox/lihe/di-tetra-hex-test.vcf.gz")
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 67
#> header_line: 68
#> variant count: 26851
#> column count: 15
#> Meta line 67 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 26851
#> Character matrix gt cols: 15
#> skip: 0
#> nrows: 26851
#> row_num: 0
#> Processed variant 1000Processed variant 2000Processed variant 3000Processed variant 4000Processed variant 5000Processed variant 6000Processed variant 7000Processed variant 8000Processed variant 9000Processed variant 10000Processed variant 11000Processed variant 12000Processed variant 13000Processed variant 14000Processed variant 15000Processed variant 16000Processed variant 17000Processed variant 18000Processed variant 19000Processed variant 20000Processed variant 21000Processed variant 22000Processed variant 23000Processed variant 24000Processed variant 25000Processed variant 26000Processed variant: 26851
#> All variants processed
vcf
#> ***** Object of Class vcfR *****
#> 6 samples
#> 1 CHROMs
#> 26,851 variants
#> Object size: 16.9 Mb
#> 0 percent missing data
#> ***** ***** *****
head(vcf)
#> [1] "***** Object of class 'vcfR' *****"
#> [1] "***** Meta section *****"
#> [1] "##fileformat=VCFv4.2"
#> [1] "##ALT=<ID=NON_REF,Description=\"Represents any possible alternative a [Truncated]"
#> [1] "##FILTER=<ID=LowQual,Description=\"Low quality\">"
#> [1] "##FILTER=<ID=PASS,Description=\"All filters passed\">"
#> [1] "##FILTER=<ID=my_filters,Description=\"QD < 2.0 || FS > 60.0 || MQ < 4 [Truncated]"
#> [1] "##FORMAT=<ID=AD,Number=R,Type=Integer,Description=\"Allelic depths fo [Truncated]"
#> [1] "First 6 rows."
#> [1]
#> [1] "***** Fixed section *****"
#> CHROM POS ID REF ALT QUAL FILTER
#> [1,] "chr01" "1339" NA "G" "A" "479.04" "PASS"
#> [2,] "chr01" "1405" NA "G" "A" "25686.77" "PASS"
#> [3,] "chr01" "1450" NA "C" "T" "738.45" "PASS"
#> [4,] "chr01" "3565" NA "G" "A" "9565.07" "PASS"
#> [5,] "chr01" "3610" NA "G" "T" "16297.58" "PASS"
#> [6,] "chr01" "3658" NA "A" "G" "2039.43" "PASS"
#> [1]
#> [1] "***** Genotype section *****"
#> FORMAT
#> [1,] "GT:AD:DP:GQ:PL"
#> [2,] "GT:AD:DP:GQ:PGT:PID:PL"
#> [3,] "GT:AD:DP:GQ:PL"
#> [4,] "GT:AD:DP:GQ:PGT:PID:PL"
#> [5,] "GT:AD:DP:GQ:PGT:PID:PL"
#> [6,] "GT:AD:DP:GQ:PGT:PID:PL"
#> BSJ_GL19_42.clean
#> [1,] "0/0/0/0/0/0:13,0:13:10:0,10,23,39,62,101,543"
#> [2,] "./././././.:24,0:24:.:.:.:0,0,0,0,0,0,0"
#> [3,] "0/0/0/0/0/0:22,0:22:17:0,17,37,63,100,163,945"
#> [4,] "0/0/0/0/0/0:9,0:9:0:.:.:0,0,0,0,0,0,224"
#> [5,] "0/0/0/0/0/1:6,1:7:3:.:.:32,0,3,9,18,35,257"
#> [6,] "0/0/0/0/0/0:13,0:13:0:.:.:0,0,0,0,7,38,398"
#> BSJ_GL19_46.clean
#> [1,] "0/0:3,0:3:9:0,9,113"
#> [2,] "1/1:0,6:6:18:.:.:270,18,0"
#> [3,] "0/0:14,0:14:42:0,42,532"
#> [4,] "1/1:0,5:5:15:1|1:3565_G_A:225,15,0"
#> [5,] "1/1:0,4:4:12:.:.:180,12,0"
#> [6,] "0/0:2,0:2:6:0|1:3658_A_G:0,6,90"
#> BSJ_GL19_47.clean
#> [1,] "0/0:5,0:5:15:0,15,197"
#> [2,] "1/1:0,8:8:24:.:.:340,24,0"
#> [3,] "0/0:16,0:16:48:0,48,588"
#> [4,] "1/1:0,5:5:15:1|1:3565_G_A:225,15,0"
#> [5,] "1/1:0,7:7:21:1|1:3600_T_G:312,21,0"
#> [6,] "0/0:9,0:9:0:.:.:0,0,270"
#> BSJ_GL19_48.clean
#> [1,] "0/0/1/1:8,13:21:4:508,23,0,4,339"
#> [2,] "0/0/0/1:22,8:30:16:.:.:283,0,16,72,958"
#> [3,] "0/0/0/0:40,0:40:44:0,44,105,211,1575"
#> [4,] "0/0/0/0:40,0:40:13:.:.:0,13,78,193,1474"
#> [5,] "0/0/0/0:36,0:36:0:.:.:0,0,0,0,927"
#> [6,] "0/0/0/0:37,0:37:0:.:.:0,0,27,128,1246"
#> BSJ_GL19_49.cleancb
#> [1,] "0/0:4,0:4:12:0,12,167"
#> [2,] "./.:8,0:8:.:.:.:0,0,0"
#> [3,] "0/0:6,0:6:12:0,12,180"
#> [4,] "0/0:9,0:9:27:.:.:0,27,338"
#> [5,] "1/1:0,8:8:24:1|1:3600_T_G:360,24,0"
#> [6,] "0/0:7,0:7:15:.:.:0,15,225"
#> [1] "First 6 columns only."
#> [1]
#> [1] "Unique GT formats:"
#> [1] "GT:AD:DP:GQ:PL" "GT:AD:DP:GQ:PGT:PID:PL"
#> [3] "GT:AD:DP:PGT:PID:PL" "GT:AD:DP:PL"
#> [1]
# vcfR2genlight uses extract.gt(), so let's see if it works.
gt <- extract.gt(vcf)
gt[1:4, ]
#> BSJ_GL19_42.clean BSJ_GL19_46.clean BSJ_GL19_47.clean
#> chr01_1339 "0/0/0/0/0/0" "0/0" "0/0"
#> chr01_1405 NA "1/1" "1/1"
#> chr01_1450 "0/0/0/0/0/0" "0/0" "0/0"
#> chr01_3565 "0/0/0/0/0/0" "1/1" "1/1"
#> BSJ_GL19_48.clean BSJ_GL19_49.cleancb BSJ_GL19_50.cleancb
#> chr01_1339 "0/0/1/1" "0/0" NA
#> chr01_1405 "0/0/0/1" NA "1/1"
#> chr01_1450 "0/0/0/0" "0/0" "0/0"
#> chr01_3565 "0/0/0/0" "0/0" "0/0"
my_gl <- vcfR2genlight(vcf)
#> Warning in vcfR2genlight(vcf): Found 2028 loci with more than two alleles.
#> Objects of class genlight only support loci with two alleles.
#> 2028 loci will be omitted from the genlight object.
#> Loading required namespace: adegenet
#> Registered S3 method overwritten by 'spdep':
#> method from
#> plot.mst ape
#> Warning in initialize(value, ...): NAs introduced by coercion
#> Warning in initialize(value, ...): NAs introduced by coercion
my_gl
#> /// GENLIGHT OBJECT /////////
#>
#> // 6 genotypes, 24,823 binary SNPs, size: 2.1 Mb
#> 51032 (34.26 %) missing data
#>
#> // Basic content
#> @gen: list of 6 SNPbin
#>
#> // Optional content
#> @ind.names: 6 individual labels
#> @loc.names: 24823 locus labels
#> @chromosome: factor storing chromosomes of the SNPs
#> @position: integer storing positions of the SNPs
#> @other: a list containing: elements without names
my_gl@gen
#> [[1]]
#> /// SNPBIN OBJECT /////////
#> 24,823 SNPs coded as bits, size: 101.4 Kb
#> Ploidy: NA
#> 24823 (100 %) missing data
#>
#> [[2]]
#> /// SNPBIN OBJECT /////////
#> 24,823 SNPs coded as bits, size: 8.8 Kb
#> Ploidy: 2
#> 327 (1.32 %) missing data
#>
#> [[3]]
#> /// SNPBIN OBJECT /////////
#> 24,823 SNPs coded as bits, size: 9 Kb
#> Ploidy: 2
#> 396 (1.6 %) missing data
#>
#> [[4]]
#> /// SNPBIN OBJECT /////////
#> 24,823 SNPs coded as bits, size: 101.4 Kb
#> Ploidy: NA
#> 24823 (100 %) missing data
#>
#> [[5]]
#> /// SNPBIN OBJECT /////////
#> 24,823 SNPs coded as bits, size: 8.8 Kb
#> Ploidy: 2
#> 333 (1.34 %) missing data
#>
#> [[6]]
#> /// SNPBIN OBJECT /////////
#> 24,823 SNPs coded as bits, size: 8.8 Kb
#> Ploidy: 2
#> 330 (1.33 %) missing data
as.matrix(my_gl)[1:4, 1:5]
#> chr01_1339 chr01_1405 chr01_1450 chr01_3565 chr01_3610
#> BSJ_GL19_42.clean NA NA NA NA NA
#> BSJ_GL19_46.clean 0 2 0 2 2
#> BSJ_GL19_47.clean 0 2 0 2 2
#> BSJ_GL19_48.clean NA NA NA NA NA
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.6 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/libblas/libblas.so.3.6.0
#> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] vcfR_1.11.0
#>
#> loaded via a namespace (and not attached):
#> [1] viridisLite_0.3.0 splines_3.6.3 gtools_3.8.1
#> [4] shiny_1.4.0 memuse_4.0-0 expm_0.999-4
#> [7] sp_1.3-1 highr_0.8 yaml_2.2.0
#> [10] LearnBayes_2.15.1 pillar_1.4.3 lattice_0.20-40
#> [13] glue_1.3.2 digest_0.6.25 promises_1.1.0
#> [16] colorspace_1.4-1 htmltools_0.4.0 httpuv_1.5.2
#> [19] Matrix_1.2-18 plyr_1.8.4 pkgconfig_2.0.3
#> [22] gmodels_2.18.1 purrr_0.3.3 xtable_1.8-4
#> [25] scales_1.0.0 gdata_2.18.0 later_1.0.0
#> [28] tibble_2.1.3 mgcv_1.8-31 ggplot2_3.2.1
#> [31] lazyeval_0.2.2 magrittr_1.5 crayon_1.3.4
#> [34] mime_0.7 deldir_0.1-23 evaluate_0.14
#> [37] nlme_3.1-144 MASS_7.3-51.5 class_7.3-16
#> [40] vegan_2.5-5 tools_3.6.3 lifecycle_0.2.0
#> [43] stringr_1.4.0 munsell_0.5.0 cluster_2.1.0
#> [46] ade4_1.7-13 compiler_3.6.3 e1071_1.7-2
#> [49] rlang_0.4.5.9000 classInt_0.4-2 units_0.6-5
#> [52] grid_3.6.3 igraph_1.2.4.1 rmarkdown_2.0
#> [55] boot_1.3-23 gtable_0.3.0 DBI_1.0.0
#> [58] reshape2_1.4.3 R6_2.4.1 knitr_1.23
#> [61] dplyr_0.8.99.9002 pinfsc50_1.1.0 fastmap_1.0.1
#> [64] seqinr_3.6-1 adegenet_2.1.1 spdep_1.1-3
#> [67] KernSmooth_2.23-16 permute_0.9-5 ape_5.3
#> [70] stringi_1.4.3 parallel_3.6.3 Rcpp_1.0.1
#> [73] vctrs_0.2.99.9010 sf_0.8-0 spData_0.3.2
#> [76] tidyselect_1.0.0 xfun_0.8 coda_0.19-2 Created on 2020-05-10 by the reprex package (v0.3.0) I'm not saying that this is the behaviour you're looking for. But I am saying that I do not feel that I'm reproducing your reported behaviour. Usually we ask folks to upgrade to the latest versions of all software. The release of R 4.0.0 (2020-04-24) may come with some "growing pains", so I wouldn't recomend that. But I think you should check that you have similar versions to what I'm running with the hope that this will reproduce the behaviour I'm seeing on my end. And I'm going to add that I'm preparing a new release of vcfR to address the release of dplyr 1.0.0. So everything is changing, apologies. Thanks! |
@knausb Congratulations on your new job. Thanks a lot for your detailed issue description. So the new version of vcfR will compatible my situation? |
Hi @lihe-s , I don't think you understand. I have failed to reproduce the behavior that you have reported. This means I can't address your issue. I suspect the issue is that you are using an outdated version of adegenet. In the above output I show that I'm using While my results look closer to success than your reported behavior, I don't think we have a solution. The results from |
@knausb Thanks for your reply. I have to say your results are similar to mine. It is true that "variant chr01_1339 has the genotype "0/0/0/0/0/0" and variant chr01_1405 has the genotype NA". If you check |
Hi @lihe-s , I'm afraid I now see that I was the one who did not understand. I reviewed the code for |
@knausb hope this make the issue clear
R version 3.6.2 (2019-12-12) on HPC clusters
require(vcfR)
vcf <- read.vcfR("di-tetra-hex-test.vcf.gz", verbose = FALSE)
#conversion
x <- vcfR2genlight(vcf)
Warning messages:
1: In vcfR2genlight(vcf) : Found 2028 loci with more than two alleles.
Objects of class genlight only support loci with two alleles.
2028 loci will be omitted from the genlight object.
2: In initialize(value, ...) : NAs introduced by coercion
3: In initialize(value, ...) : NAs introduced by coercion
#ploidy level, the BSJ_1 should be hexaploid(6), BSJ_4 should be tetraploid(4)
ploidy(x)
BSJ_1 BSJ_2 BSJ_3 BSJ_4 BSJ_5 BSJ_6
NA 2 2 NA 2 2
#part of genotype data
vcf@gt[1,]
FORMAT BSJ_1 BSJ_2 BSJ_3 BSJ_4 BSJ_5 BSJ_6
"GT:AD:DP:GQ:PL" "0/0/0/0/0/0:13,0:13:10:0,10,23,39,62,101,543" "0/0:3,0:3:9:0,9,113" "0/0:5,0:5:15:0,15,197" "0/0/1/1:8,13:21:4:508,23,0,4,339" "0/0:4,0:4:12:0,12,167" "./.:0,0:0:.:0,0,0"
#after conversion
t(as.matrix(x))[c(1:50), 1:3]
BSJ_1 BSJ_2 BSJ_3
chr01_1339 NA 0 0
chr01_1405 NA 2 2
chr01_1450 NA 0 0
chr01_3565 NA 2 2
chr01_3610 NA 2 2
chr01_3658 NA 0 0
chr01_3691 NA 0 0
chr01_3709 NA 0 0
chr01_3748 NA 0 0
chr01_3760 NA 0 0
chr01_3913 NA NA NA
chr01_4054 NA NA NA
chr01_4153 NA NA NA
chr01_4189 NA NA NA
chr01_4195 NA NA NA
chr01_4198 NA NA NA
It looks like the vcfR2genlight can not convert polyploid samples. This is not a rare issue #128 #117 (also a recent paper https://doi.org/10.1038/s41559-019-0807-4). Please consider to help me figure it out. Thanks.
Sorry for the irregular format.
My real world test data will send to you.
The text was updated successfully, but these errors were encountered: