-
Notifications
You must be signed in to change notification settings - Fork 13
/
filtering_data.Rmd
163 lines (94 loc) · 5.67 KB
/
filtering_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
title: "Filtering data"
output:
html_document
---
The output of variant calling pipelines result in files containing called variants.
Many of these pipelines recommend further filtering of this data to ensure that only high quality variants are included in downstream analyses.
In this vignette I discuss strategies to focus on the high quality fraction of these variants.
## Data input
As in other vignettes, we begin by loading the example data.
Here we'll use this data to create a chromR object directly after inputting the data.
```{r}
library(vcfR)
vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz", package = "pinfsc50")
dna_file <- system.file("extdata", "pinf_sc50.fasta", package = "pinfsc50")
gff_file <- system.file("extdata", "pinf_sc50.gff", package = "pinfsc50")
vcf <- read.vcfR(vcf_file, verbose = FALSE)
dna <- ape::read.dna(dna_file, format = "fasta")
gff <- read.table(gff_file, sep="\t", quote="")
chrom <- create.chromR(name="Supercontig", vcf=vcf, seq=dna, ann=gff, verbose=FALSE)
chrom <- proc.chromR(chrom, verbose = TRUE)
```
## Using a mask
Instead of removing variants, vcfR implements a mask.
This allows changes to be easily undone and it preserves the geometry of the data matrix.
Preserving the geometry of the matrix also allows for multiple manipulations to be made easily.
When we've settled on a set of manipulations which we like we can then use this mask to subset the data.
The function masker() censors variants (i.e., sets the mask) based on several thresholds.
A number of summaries regarding variant quality are reported in VCF files.
These summaries may differ in content depending on what software was used to create them, as well as what options may have been used.
For example, a quality metric may range from 0 to 20 or 0 to 100 depending on choices made by the developer (i.e., are they raw probabilities or phred converted, etc.).
Because of this, the default values for masker() are not likely to be very useful.
The user will need to explore their data to identify useful thresholds.
We can use the head() function to see what sort of information is contained in our data.
```{r}
head(chrom)
```
The masker() function uses QUAL, sequence depth and mapping quality to try to filter out low quality variants.
The parameter QUAL is a part of the VCF file definition.
Because of this, it is not documented in the meta region.
This parameter should always be present.
However, it may be set to missing ('.' in the file, NA as read in by vcfR) or populated with a constant, both of which render this metric as unuseful.
The parameters DP and MQ are in the INFO column an are not part of the VCF definition.
The parameter DP is not a required field, so it is defined in the meta region.
We can remind ourselves how these parameters are defined by querying the meta region.
```{r, tidy=TRUE}
strwrap(grep("ID=MQ,", chrom@vcf@meta, value=T))
strwrap(grep("ID=DP,", chrom@vcf@meta, value=T))
```
We see that 'DP' has different definitions in the INFO column than in the genotype region.
The discrepancy appears to be due to some variants which have a high number of raw reads but only a fraction of these have been determined to be of high quality.
Because masker() uses the DP column in the var.info slot to judge depth, we'll change it to include only the high quality depth.
We can now use the plot function to visualize this data.
```{r, fig.height=7, fig.width=7}
plot(chrom)
```
Our summary of DP appears continuous.
However, we do have variants which appear to have unusually low coverage which we may try to filter out.
We also see that DP is very long tailed in that it has a number of variants with exceptionally high coverage.
These typically come from repetetive regions that have reads which map to multiple locations in the genome.
Mapping quality is largely of one value (60).
But there are some lower values which we may want to remove.
Quality is notable in that it is fairly clinal and may not be useful for filtering.
We can also use the chromoqc function to see how the data are distributed along the chromosome.
```{r, fig.height=7, fig.width=7}
chromoqc(chrom, dp.alpha = 22)
```
Now that we have our first peek at the data, we can try to parameterize the masker function to isolate variants we feel to be of high quality.
```{r, fig.height=7, fig.width=7}
chrom <- masker(chrom, min_QUAL=0, min_DP=350, max_DP=650, min_MQ=59.5, max_MQ=60.5)
chrom <- proc.chromR(chrom, verbose = FALSE)
plot(chrom)
```
Note that we need to run proc.chrom() again.
This will update the variant count per window.
The distribution of our read depths has now become narrower and excludes low coverage variants, it is also fairly symmetric.
The distribution of mapping quality has also become more narrow.
Note that while in this example we were able to rapidly pick relatively good thresholds.
This was the result of some trial and error.
The user should expect this process to take a few attempts.
We can use chromoqc() to see how applying this mask affects our data.
```{r, fig.height=7, fig.width=7}
chromoqc(chrom, dp.alpha = 22)
```
The chromo plot shows us that the variants of low quality populated the 3' end of the supercontig.
Filtering has removed all of the variants from this region.
Use of stringent filtering may be used to build a conservative set of variants.
This conservative set may be used for analysis or it may be used as a training set for subsequent rounds of variant discovery.
## Output
Once satisfactory filtering has been completed you'll want to save this high quality set of variants.
This can be done by writing these variants to a new VCF file with the function write.vcf().
```{r, eval=FALSE}
write.vcf(chrom, file="good_variants.vcf.gz", mask=TRUE)
```