Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove MAX AF >= 1 filter from bcftools SNV and InDel filters in all workflows #1180

Closed
mathiasbio opened this issue Jun 16, 2023 · 9 comments
Assignees
Labels
Feature New feature
Milestone

Comments

@mathiasbio
Copy link
Collaborator

mathiasbio commented Jun 16, 2023

Need

Background to this feature can be found here: #1166

In short the need is to remove this filter bcftools filter--include FORMAT/AF[0] < 1 --soft-filter balsamic_af_one --mode + which occurs in all workflows for filtering of SNVs and InDels, used here: in sentieon_quality_filter.rule

This should be removed because it's a risk, especially in very pure and somewhat lower-coverage tumor-samples, that a true somatic variant may reach 1 AF and be filtered out.

Suggested approach

The suggested approach is simply to remove it from the rule. However, we want to make sure first that the filter is not terribly important for filtering out false positives so that we can if necessary make a more sophisticated solution.

Considered alternatives

Were there alternative approaches which have been rejected?

Requests/suggestions/bugs solved by the feature

Can be closed when

Link the issues needed to be closed for this to be implemented

Blockers

Anything preventing this from happening?

@mathiasbio mathiasbio added the Feature New feature label Jun 16, 2023
@github-project-automation github-project-automation bot moved this to Todo in BALSAMIC Jun 16, 2023
@mathiasbio mathiasbio added this to the TBD milestone Jun 16, 2023
@mathiasbio mathiasbio self-assigned this Sep 8, 2023
@mathiasbio
Copy link
Collaborator Author

mathiasbio commented Sep 8, 2023

On raw VCFs generated from vardict and TNscope with the most recent version of balsamic 12.0.2 I have run bcftools the same way we do in the TGA and WGS workflow.

For TGA:
singularity exec varcall_py3.sifbcftools filter --include 'INFO/AF < 1' --soft-filter 'balsamic_af_one' --mode '+' -o ${1}_balsamic_af_one.vcf $1

For WGS:
singularity exec varcall_py3.sif bcftools filter --include 'FORMAT/AF[0] < 1' --soft-filter 'balsamic_af_one' --mode '+' -o ${1}_balsamic_af_one.vcf $1

analysis_type filename total variants balsamic_af_one percent balsamic_af_one
WES-T SNV.somatic.case.vardict.vcf.gz 1416117 9435 0,67%
WES-T SNV.somatic.case.vardict.vcf.gz 125365 22618 18,04%
WES-T SNV.somatic.case.vardict.vcf.gz 193787 19590 10,11%
TGA-T SNV.somatic.case.vardict.vcf.gz 39088 103 0,26%
TGA-T SNV.somatic.case.vardict.vcf.gz 50362 94 0,19%
TGA-T SNV.somatic.case.vardict.vcf.gz 38119 109 0,29%
TGA-T SNV.somatic.case.vardict.vcf.gz 31539 146 0,46%
TGA-TN SNV.somatic.case.vardict.vcf.gz 185420 112 0,06%
WGS-T SNV.somatic.case.tnscope.vcf.gz 5751184 1735967 30,18%
WGS-T SNV.somatic.case.tnscope.vcf.gz 5701450 1673594 29,35%
WGS-TN SNV.somatic.case.tnscope.vcf.gz 186947 161 0,09%
WGS-TN SNV.somatic.case.tnscope.vcf.gz 209264 189 0,09%
WGS-TN SNV.somatic.case.tnscope.vcf.gz 170847 175 0,10%
WGS-TN SNV.somatic.case.tnscope.vcf.gz 180961 165 0,09%

Conclusion from this testing is that it seems that the main purpose of this filter has been to remove germline-variants from tumor-only analyses with low to moderately low coverage (WGS and WES). For tumor only TGA the coverage is probably so high that purely by chance it is unlikely to have exactly 100% of let's say 1000X reads in a position to have the same base.

It would be nice to see what happens with these variants with AF = 1 downtream in the filtering in the tumor only analyses to see if they are filtered out by the population and LoqusDB databases. I see two scenarios here:

  1. The majority of these variants in tumor only analyses are filtered out in downstream filters, at which point I would recommend removing this filter for all types of analyses! Tumor only and tumor + normal.
  2. There's a lot of AF=1 variants left despite the downstream filters in the tumor only analyses, at which point we might need to consider implementing some additional filters before removing this in tumor only, but I still really don't like keeping this filter...but maybe it's necessary...but at least for the TN analyses this can be removed!

In the original issue: #1166 I wrote that if the original idea with the filter was to remove variants with an illegal AF above 1 then the filter could be changed to --exclude 'FORMAT/AF[0] > 1.0' --soft-filter balsamic_af_one --mode +

I also tested changing the original filter to this, and confirmed that no variants with an AF of 1 was filtered out, I modified manually a VCF to have an AF above 1, at which point it was filtered out by this filter. So if we want to keep the filter for the purpose of filtering out variants with an illegal AF changing the filter in this way would work, however, I did not see any real variant in either Vardict or TNscope with an AF larger than 1. So I think the question should be focused on either saving or keep removing the variants with AF = 1.

@mathiasbio
Copy link
Collaborator Author

I'd also like to note that in the last NIQAS3 the clinically relevant variant was a SNV with roughly 0.996 AF, which means it was very close to being filtered out with this filter.

@mathiasbio mathiasbio moved this from Todo to Planned in BALSAMIC Sep 12, 2023
@vwirta
Copy link

vwirta commented Sep 14, 2023

I'd also like to note that in the last NIQAS3 the clinically relevant variant was a SNV with roughly 0.996 AF, which means it was very close to being filtered out with this filter.

Thanks @mathiasbio for great summary.
I agree in that this filter is something we really need to address. We should have a solid reasoning for why we include a filter.

Do you know anything about the NIQAS3 sample? For example, what was the tumour cell fraction in the sample? The reason I am asking is that I have really hard time to understand in what biological context a somatic variant could have VAF=1. It could of course be a measuring artefact (random sampling bias), but this should be very rare. A high VAF could also be explained by the presence of a CNA in the T sample for the region of the variant, but in this case the VAF should approach 1 (but never reach it).

@mathiasbio
Copy link
Collaborator Author

@vwirta I don't remember exactly! But I remember something about overlapping CNVs in the region, I could ask Fulya. I also think it's very rare that a somatic variant would have an AF of 1, but Kalle brought my attention to this problem to begin with based on some ILC (https://github.com/Clinical-Genomics/External-comparison/issues/22 I think) where we missed some 100% VAF. Probably happens very rarely, so we don't need to get too anxious, but it would be nice to be sure : )

@vwirta
Copy link

vwirta commented Sep 17, 2023

I've asked Fulya regarding the NIQAS3 case.
Good point with the ILC. This was the one organised by Gustav Roussey, and I think it would be good to follow up with them as well regarding the missed variants. @AnnaLeinfelt Do you have the final report from the ILC? I can't find it in my email box.

@AnnaLeinfelt
Copy link

I've asked Fulya regarding the NIQAS3 case. Good point with the ILC. This was the one organised by Gustav Roussey, and I think it would be good to follow up with them as well regarding the missed variants. @AnnaLeinfelt Do you have the final report from the ILC? I can't find it in my email box.

No, I don't. I was in contact earlier and then the results was about to be submitted. I'll contact them again. This slipped my mind.

@mathiasbio
Copy link
Collaborator Author

mathiasbio commented Jan 3, 2024

Also wrote this in the PR: #1338

sequencing type T/TN case # PASS variants release 13 # PASS variants this PR # additional PASS variants % extra PASS variants
WGS T civilsole 33051 33051 0 0
WGS T firstviper 54072 54072 0 0
WGS TN fleetjay 7189 7190 1 0,01%
TGA T setamoeba 2415 2416 1 0,04%
TGA TN unitedbeagle 1573 1573 0 0
TGA UMI T (TNscope) equalbug 158 158 0 0
TGA UMI T (VarDict) equalbug 70 70 0 0
TGA UMI TN (TNscope) uphippo 124 124 0 0
TGA UMI TN (VarDict) uphippo 105 105 0 0

Note that likely many variants would have been added in the T-only WGS cases if at the same time the T-only WGS specific filter bcftools filter --threads {threads} --include 'FORMAT/ALT_F1R2 > {params.strand_reads[0]} && (FORMAT/ALT_F1R2 > 0 && FORMAT/ALT_F2R1 > {params.strand_reads[0]} && FORMAT/REF_F1R2 > {params.strand_reads[0]} && FORMAT/REF_F2R1 > {params.strand_reads[0]})' --soft-filter '{params.strand_reads[1]}' --mode '+' was also removed. But as this filter also requires that some reads support a reference variant, then removing the MAX_AF 1 filter in this analysis type has no real effect.

I think we can conclude with these stats that it is safe to remove the MAX AF filter, and we can postpone extending this fix to the WGS Tumor only analysis for later.

@vwirta
Copy link

vwirta commented Jan 3, 2024

Also wrote this in the PR: #1338

sequencing type T/TN case # PASS variants release 13 # PASS variants this PR # additional PASS variants % extra PASS variants
WGS T civilsole 33051 33051 0 0
WGS T firstviper 54072 54072 0 0
WGS TN fleetjay 7189 7190 1 0,01%
TGA T setamoeba 2415 2416 1 0,04%
TGA TN unitedbeagle 1573 1573 0 0
TGA UMI T (TNscope) equalbug 158 158 0 0
TGA UMI T (VarDict) equalbug 70 70 0 0
TGA UMI TN (TNscope) uphippo 124 124 0 0
TGA UMI TN (VarDict) uphippo 105 105 0 0
Note that likely many variants would have been added in the T-only WGS cases if at the same time the T-only WGS specific filter bcftools filter --threads {threads} --include 'FORMAT/ALT_F1R2 > {params.strand_reads[0]} && (FORMAT/ALT_F1R2 > 0 && FORMAT/ALT_F2R1 > {params.strand_reads[0]} && FORMAT/REF_F1R2 > {params.strand_reads[0]} && FORMAT/REF_F2R1 > {params.strand_reads[0]})' --soft-filter '{params.strand_reads[1]}' --mode '+' was also removed. But as this filter also requires that some reads support a reference variant, then removing the MAX_AF 1 filter in this analysis type has no real effect.

I think we can conclude with these stats that it is safe to remove the MAX AF filter, and we can postpone extending this fix to the WGS Tumor only analysis for later.

Thanks @mathiasbio for a great summary again!
I agree with your conclusion.

@mathiasbio
Copy link
Collaborator Author

merged into #1320 🥳
closing issue!

@github-project-automation github-project-automation bot moved this from In Progress to Completed in BALSAMIC Jan 4, 2024
@mathiasbio mathiasbio modified the milestones: Release 14, Release 13 Jan 4, 2024
@mathiasbio mathiasbio changed the title Remove MAX AF >= 1 filter from bcftools SNV and InDel filters in all workflows Remove MAX AF >= 1 filter from bcftools SNV and InDel filters in TN workflows Jan 9, 2024
@mathiasbio mathiasbio changed the title Remove MAX AF >= 1 filter from bcftools SNV and InDel filters in TN workflows Remove MAX AF >= 1 filter from bcftools SNV and InDel filters in all workflows Jan 9, 2024
@ivadym ivadym moved this from Completed to Archived in BALSAMIC Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature
Projects
Status: Archived
Development

No branches or pull requests

4 participants