-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
option for no smith-waterman in the case of only looking at snps and speed issues with many 0 depth loci #28
Comments
Hi Haynes, Thanks for the suggestions. The main reason I went with the full SW alignment is that in addition to the indels, STAR introduces all sorts of weird alignments, as you saw in your attempts to get good variant calls. I was also inspired by the way LongRanger works to assign barcodes to haplotypes. A pure pileup version would definitely be faster, and could be a good option. There is CIGAR parsing already to check if an alignment is a N-op over the region of interest. This could be extended. I suspect that your runtime concerns are bam fetch related, but also slightly more complicated -- the reason I filter the N-op issue mentioned above is that STAR likes to make crazy huge splices that span hundreds of kb. BAM fetch will still pull these in. It may be possible to use the pileup API to bypass this? |
My issue has probably the same cause: vartrix and cellSNP take too long (demuxlet runs 10X faster, while allele counting is only part of its pipeline). And it seems to me that those are stuck at Y-chromosome (while data contains only X), thus probably connected to this issue |
@wheaton5 could you send me the VCF you're using? I'll do some profiling to understand why it's slow. You can send it to me via redstone: https://support.10xgenomics.com/docs/transfers |
@pmarks I have them on my google drive and you should be able to wget them with the following commands
for GRCh38 and
for hg19 |
@wheaton5 what aligner created the BAM file that you're using? I profiled the vcf you sent me against a 10k PBMC BAM file from Cell Ranger (so STAR alignments), and I get this:
So my guess is that it's taking a bunch of time loading the crazy STAR records with massive refskip blocks that don't have any actual bases aligned to the target region. In your VCF, there's a SNP every 300bp, so we're probably reloading those long records many times. We could test this idea by filtering out those records (I think they're a small fraction of the total, right?). So I think our options are: Any thoughts? |
I am also testing vs a STAR aligned bam. Specifically the one here
I think that is a good plan. But how to do this efficiently? Seems like you will need to make a new bam file. What I do is to filter the vcf to loci with > min_depth (default 8) using samtools depth + awk + bedtools (there might be faster ways but this is reasonable and is low memory). I would love if 10x would move to hisat2 for cellranger, but I realize that is unlikely. The alignments are much more conducive to variant calling. |
Another issue related to this. When vartrix does realignment across a locus which in hisat2 or STAR has been called as a splice site, it can make big mistakes because I assume vartrix's smith-waterman is not splice-aware. For this I would either like a --ignore_spliced_reads or --no_realignment or --ignore_reads_in_splice_regions where the last one would be okay evaluating a read at a location for which that read is not currently cigar N, but would ignore that read if that read's current alignment contains an N at this base of interest. |
Hey Ian,
This is just a thought. I realize that the smith-waterman to both haplotypes helps sort out alignment edge effects with indels, but does it do anything for SNPs? I realize that a nearby indel can still cause problems, but I don't know if vartrix uses nearby variants or not. And in any case, if I feed it a vcf with only SNPs, it won't know about nearby indels. I wonder how much faster it would be to have a purely pileup matching strategy as a non-default option?
Also I was running vartrix using 1kgenomes variants filtered to >=2 percent allele frequency and it was taking quite a while (I think I killed it after 6 hours and I was using 8 cores). So what I am doing now is I am using a combination of samtools depth, awk, and bedtools to filter to regions that have > x coverage (for my purposes, cov >= 8 or so) which drops me down from tens of millions of variants to ~120k variants and then it runs in a very reasonable amount of time. This works for me, but what do you think the issue is here? Bam fetches take time even if they retrieve 0 reads or only a few reads I guess. Even if you just use samtools depth + some filtering steps under the hood, it might be nice to include these as options (--min-coverage).
Anyway, just some potential improvement suggestions.
Best,
Haynes
The text was updated successfully, but these errors were encountered: