Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling indels #45

Open
PMuchina opened this issue Nov 4, 2024 · 4 comments
Open

Handling indels #45

PMuchina opened this issue Nov 4, 2024 · 4 comments

Comments

@PMuchina
Copy link

PMuchina commented Nov 4, 2024

How does QUILT handle indels? It seems indels are skipped when building the binary reference.

[2024-10-31 22:41:27] Program start
[2024-10-31 22:41:27] Begin converting reference haplotypes
[2024-10-31 22:41:27] Using reference information from:/usr/home/ironbank/researchdata/qgg/zexi/BSF/Peter/Chr1/QUILTv1.0.5/Chr1_reference_panel.bcf
[2024-10-31 22:41:27] Using strategy of first imputing using common SNPs and then using all SNPs, with allele frequency threshold:0.001
[2024-10-31 22:41:27] Begin get sites and haplotypes from reference vcf
[2024-10-31 23:07:55] End get sites and haplotypes from reference vcf
[2024-10-31 23:07:55] There were 59759 skipped variants when processing the reference VCF (not bi-allelic, not a SNP, not unique position
[2024-10-31 23:07:56] There are 0 common and 0 rare (0 total) variants in the left buffer region -499999 <= position < 1
[2024-10-31 23:07:56] There are 428384 common and 301 rare (428685 total) variants in the central region 1 <= position <= 4999987
[2024-10-31 23:07:56] There are 33651 common and 2 rare (33653 total) variants in the right buffer region 4999987 < position <= 5499987
[2024-10-31 23:07:56] There are 0 regions out of 14438 below minimum recombination rate, setting them to minimum rate
[2024-10-31 23:07:56] There are 0 regions out of 14438 above maximum recombination rate, setting them to maximum rate
[2024-10-31 23:08:06] Using nMaxDH = 114
[2024-10-31 23:08:11] Build mspbwt indices
[2024-10-31 23:08:13] Done building mspbwt indices
[2024-10-31 23:08:13] Save converted reference haplotypes
[2024-10-31 23:08:14] Done converting reference haplotypes

@PMuchina
Copy link
Author

PMuchina commented Nov 4, 2024

The above is for QUILT2_prepare_reference.R.
So I tried QUILT_prepare_reference.R, and I got the same. It seems STITCH only handles biallelic SNPs

[2024-11-04 23:57:59] Program start
[2024-11-04 23:57:59] Begin converting reference haplotypes
[2024-11-04 23:57:59] Using reference information from:Chr1.hap.gz and new_legend.gz
[2024-11-04 23:58:35] Begin get haplotypes from reference
[2024-11-04 23:58:35] Load and validate reference legend header
[2024-11-04 23:58:35] Load and validate reference legend
Error in validate_pos(decoy_legend, chr = chr, stop_file_name = paste0("The reference legend file ", :
The reference legend file new_legend.gz ref column entry TTCCTCCATTTTCGTCATCATCAAATCACAGGCTATACACGCT in row 19 contains is not one or A, C, G or T. STITCH is only supported for bi-allelic SNPs
Calls: QUILT_prepare_reference ... get_haplotypes_from_reference -> validate_reference_legend -> validate_pos
Execution halted

@Zilong-Li
Copy link
Collaborator

Thanks @PMuchina,

Unfortunately, QUILT does not support imputing Indels at the moment. To my knowledge, Indels or SVs imputation is always challenging.

@sounkou-bioinfo
Copy link

Hi all,
Is the challenge due to the high error rate in the reference panels for indels (this is less of the case for new variant calling pipelines) , variants representation or the algorithms requires 1bp regions ?
As i understand, multi-allelic sites are also not imputed, maybe for similar reasons ?

@rwdavies
Copy link
Owner

I'd say a key reason is just historical / lack of time. I wrote much of STITCH (which includes the code to read in and process bams at variant sites) back in ~2012-2014 or so, when indels were less of a priority, and much less accurate. Adding bi-allelic indels to the current code base isn't conceptually hard, but would require time (say 1-2 weeks full time). I'm in industry now and don't have that kind of time. Zilong could do it but it would need to be weighed against his other responsibilities and projects. Multi-allelic variants (indels or SNPs) are slightly harder though still doable (especially in QUILT) though for STITCH they'd be a bit trickier (I think)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants