-
Notifications
You must be signed in to change notification settings - Fork 13
/
quality_filtering.Rmd
247 lines (160 loc) · 6.12 KB
/
quality_filtering.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
---
title: "Quality filtering"
output:
html_document:
toc: true
toc_float: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.align = 'center')
knitr::opts_chunk$set(fig.width = 12)
```
Once a sequencing run has been completed the sequence reads are typically aligned to a reference genome and variants are called.
These variants are typically stored in a VCF format file.
THe R package vcfR was designed to import and export this data into the R environment.
Once in the R environment, vcfR provides tools to manipulate the data to help you understand its content.
Common issues in variant data may be that certain samples may be of low quality or certain vriants may be of questionable quality.
Here we'll provide examples of how vcfR can help accomplish these tasks.
## Data import
Our first step will be to import the VCF data into R.
The result will be that our VCF data will be stored in a 'vcfR' object.
Once we have read it in we can validate our data is what we expect by using the `show` method.
The `show` method is available for many types of objects and is implemented when the object's name is entered at the console with no other information.
THe `show` method typically provides summary information about the object and what it contains.
```{r, results='hide'}
library(vcfR)
vcf <- read.vcfR('TASSEL_GBS0077.vcf.gz')
class(vcf)
```
```{r}
vcf
```
The results from the `show` method informs us that our vcfR object contains 61 samples and 69,296 variants.
This is what we expected so we can be confident that our data was imported correctly.
The output also informs us that there is just over 37% missing data in our file.
In our experience this is typical for GBS data sets.
## Extract depth (DP)
The vcfR function `extract.gt()` is used to extract matrices of data from teh GT portion of VCF data.
the funtion `extract.gt()` provides a link between VCF data and R.
Much of R is designed to operate on matrices of data and once `extract.gt()` provides this matrix the universe of R becomes available.
Note that we use the 'as.numeric=TRUE' option here.
We should only use this option when we are certain that we have numeric data.
If you use it on non-numeric data R will do its best to do something, which is not likely to be what you expect.
We can use the `queryMETA()` function remind us what this element is.
```{r}
queryMETA(vcf, element = 'FORMAT.+DP')
vcf@gt[1:4,1:4]
dp <- extract.gt(vcf, element = "DP", as.numeric=TRUE)
dp[1:4,1:3]
```
## Missing data
```{r}
sum(is.na(dp[,1]))
# apply(dp, MARGIN = 2, function(x){sum(is.na(x))})
```
```{r}
myMiss <- apply(dp, MARGIN = 2, function(x){sum(is.na(x))})
myMiss <- myMiss/nrow(vcf)
library(RColorBrewer)
palette(brewer.pal(n=12, name = 'Set3'))
par(mar = c(12,4,4,2))
barplot(myMiss, las = 2, col = 1:12)
title(ylab = "Missingness")
par(mar = c(5,4,4,2))
```
## Sequence depth
```{r}
boxplot(dp, col=2:8, las=3)
```
```{r}
library(reshape2)
library(ggplot2)
library(cowplot)
# Melt our matrix into a long form data.frame.
dpf <- melt(dp, varnames=c('Index', 'Sample'), value.name = 'Depth', na.rm=TRUE)
dpf <- dpf[ dpf$Depth > 0,]
# Create a row designator.
# You may want to adjust this
#samps_per_row <- 20
samps_per_row <- 16
myRows <- ceiling(length(levels(dpf$Sample))/samps_per_row)
myList <- vector(mode = "list", length = myRows)
for(i in 1:myRows){
myIndex <- c(i*samps_per_row - samps_per_row + 1):c(i*samps_per_row)
myIndex <- myIndex[myIndex <= length(levels(dpf$Sample))]
myLevels <- levels(dpf$Sample)[myIndex]
myRegex <- paste(myLevels, collapse = "$|^")
myRegex <- paste("^", myRegex, "$", sep = "")
myList[[i]] <- dpf[grep(myRegex, dpf$Sample),]
myList[[i]]$Sample <- factor(myList[[i]]$Sample)
}
# Create the plot.
myPlots <- vector(mode = "list", length = myRows)
for(i in 1:myRows){
myPlots[[i]] <- ggplot(myList[[i]], aes(x=Sample, y=Depth)) +
geom_violin(fill="#90EE90", adjust=1.0, scale = "count", trim=TRUE)
myPlots[[i]] <- myPlots[[i]] + theme_bw()
myPlots[[i]] <- myPlots[[i]] + theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 60, hjust = 1))
myPlots[[i]] <- myPlots[[i]] + scale_y_continuous(trans=scales::log2_trans(),
breaks=c(1, 10, 100, 800),
minor_breaks=c(1:10, 2:10*10, 2:8*100))
myPlots[[i]] <- myPlots[[i]] + theme( panel.grid.major.y=element_line(color = "#A9A9A9", size=0.6) )
myPlots[[i]] <- myPlots[[i]] + theme( panel.grid.minor.y=element_line(color = "#C0C0C0", size=0.2) )
}
```
```{r, fig.height=20}
# Plot the plot.
plot_grid(plotlist = myPlots, nrow = myRows)
```
Once we have extracted the quantile information we can use apply and quantile to build intervals for what we may consider acceptable coverage.
```{r}
quants <- apply(dp, MARGIN=2, quantile, probs=c(0.1, 0.8), na.rm=TRUE)
#quants <- apply(dp, MARGIN=2, quantile, probs=c(0.34, 0.68), na.rm=TRUE)
quants[,1:17]
```
We can now use these thresholds to censor data outside this threshold.
```{r}
dp2 <- sweep(dp, MARGIN=2, FUN = "-", quants[1,])
dp[dp2 < 0] <- NA
dp2 <- sweep(dp, MARGIN=2, FUN = "-", quants[2,])
dp[dp2 > 0] <- NA
dp[dp < 4] <- NA
```
Update the vcfR object with the censored data.
```{r}
vcf@gt[,-1][ is.na(dp) == TRUE ] <- NA
```
We'll want to see how this has affected the missingness of our vcfR object.
```{r}
vcf
```
We'll want to mitigate variants with a high degree of missingness.
```{r}
dp <- extract.gt(vcf, element = "DP", as.numeric=TRUE)
miss <- apply(dp, MARGIN=1, function(x){sum(is.na(x))})
miss <- miss/ncol(dp)
```
Plot a histogram.
```{r, fig.align='center'}
hist(miss, col=8, breaks=seq(0,1,by=0.02))
```
Omit variants with a high degree of missingness.
```{r}
#vcf <- vcf[miss < 0.05,]
vcf <- vcf[miss < 0.1,]
vcf
```
```{r, fig.align='center', fig.width=12, fig.height=12}
dp <- extract.gt(vcf, element = "DP", as.numeric=TRUE)
heatmap.bp(dp, rlabels = FALSE)
#heatmap.bp(dp[1:1000,], rlabels = FALSE)
```
```{r, fig.align='center', fig.width=12}
boxplot(dp, col=2:8, las=3)
```
## Output to file
```{r, eval=FALSE}
write.vcf(tas, 'TASSEL_GBS0077_dp_filtered.vcf.gz')
```