-
Notifications
You must be signed in to change notification settings - Fork 13
/
ranking_data.Rmd
223 lines (134 loc) · 7.03 KB
/
ranking_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
title: "Ranking data"
output:
html_document
---
In the vignette 'Filtering data' we used thresholds as an attempt to isolate the high quality fraction of variants from a VCF file.
Here we assign ranks to variants within windows.
This information used alone, or in conjunction with thresholds, may be an effective strategy to identify high quality variants.
## Data
As in other vignettes, we begin by loading the example data.
```{r}
library(vcfR)
vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz", package = "pinfsc50")
dna_file <- system.file("extdata", "pinf_sc50.fasta", package = "pinfsc50")
gff_file <- system.file("extdata", "pinf_sc50.gff", package = "pinfsc50")
vcf <- read.vcfR(vcf_file, verbose = FALSE)
dna <- ape::read.dna(dna_file, format = "fasta")
gff <- read.table(gff_file, sep="\t", quote = "")
chrom <- create.chromR(name="Supercontig", vcf=vcf, seq=dna, ann=gff, verbose=FALSE)
#chrom <- masker(chrom, min_DP = 900, max_DP = 1500)
chrom <- proc.chromR(chrom, verbose = TRUE)
```
## Creating scores to rank
Before we can rank our variants, we need to come up with some sort of criteria to help us determine if a variant is high or low quality.
Once we have this score we can select the variant with the highest score from each window.
In order to create our vector of scores, let's remind ourselves of what data we have.
```{r}
head(chrom)
```
Let's use the genotype quality (GQ) and sequence depth (DP) from the VCF genotype information.
We can isolate matrices of genotype quality and sequence depth with the extract.gt function.
```{r}
gq <- extract.gt(chrom, element="GQ", as.numeric=TRUE)
dp <- extract.gt(chrom, element="DP", as.numeric=TRUE)
```
We can visualize these data with box and whisker plots.
```{r, fig.align='center', fig.width=7, fig.height=5}
#hist(gq[,1])
par( mar = c(8,4,4,2) )
boxplot(gq, las=2, col=2:5, main="Genotype Quality (GQ)")
dp2 <- dp
dp2[ dp2 == 0 ] <- NA
boxplot(dp2, las=2, col=2:5, main="Sequence Depth (DP)", log="y")
abline(h=10^c(0:4), lty=3, col="#808080")
par( mar = c(5,4,4,2) )
```
The values for genotype quality appear to range from 0 to 100 with among sample variability.
For example, sample P13626 consists of variants which are predominantly near 100 while sample P1362 consists of variants with qualities mostly just below 20.
Comparison of the plots suggests that there is a correlation among sequence depth (DP) and genotype qualities (GQ) where samples with variants of high sequence depth have variants of high genotype quality.
Unlike genotype quality, we don't necessarily want to maximize on sequence depth.
Low depth variants may make obvious poor choices, but excessive coverage may represent variants from repetitive regions of the genome.
What we really want to optimize on is mean depth, or some other measure of central tendency.
This will require a little mathematical gymnastics.
If we substract from each library its mean (or other measure of central tendency) it will center the data around zero.
We can then take an absolute value which will cause the data to range from zero to some infinite number with zero being our optimal value (the measure of central tendency).
The algorithm we're going to use looks for an optimum and not a minimum, so if we multiply by negative one our data will range from negative infinity to zero with zero being optimal.
We now have a measure of depth where the greatest value is the optimal value.
```{r}
mids <- apply(dp, MARGIN=2, median, na.rm=TRUE)
dp2 <- sweep(dp, MARGIN=2, mids, FUN="-")
dp2 <- abs(dp2)
dp2 <- -1 * dp2
```
```{r, fig.align='center', fig.width=7, fig.height=5}
par( mar = c(8,4,4,2) )
boxplot(dp2, las=2, col=2:5, main="Sequence Depth (DP)")
par( mar = c(5,4,4,2) )
```
Before we combine these data we have one more issue we need to address.
In their current state, sequence depth's range is much greater than genotype quality.
This means that the data are effectively weighted, if we simply add them together the sequence depth will have a greater impact on the final metric than will genotype quality.
If we are happy with that then we can proceed.
If we would like to equalize each metric's contribution to our final measure of quality we'll want to normalize the data.
The genotype quality data is fairly straight forward.
If we divide each library by 100 (their theoretical maximum) they will scale from 0 to 1 instead of 0 to 100.
For the sequence depth we can add the absolute value of the minimum value to each library, this will make all of the data positive.
Then we can divide by this value and our data should then scale from 0 to 1.
```{r}
gq2 <- gq/100
range(gq2, na.rm=TRUE)
amins <- abs(apply(dp2, MARGIN=2, min, na.rm = TRUE))
dp2 <- sweep(dp2, MARGIN=2, STATS = amins, FUN="+")
dp2 <- sweep(dp2, MARGIN=2, STATS = amins, FUN="/")
range(dp2, na.rm=TRUE)
```
We now have metrics which are fairly equal.
We can add them together and summarize over variants.
```{r, fig.align='center'}
scores <- dp2 + gq2
scores <- rowSums(scores, na.rm = TRUE)
```
Check their distribution with a histogram.
```{r, fig.align='center'}
hist(scores, col=4)
```
Once we have scores in hand we can use them to rank our variants.
```{r}
chrom <- rank.variants.chromR(chrom, scores)
head([email protected])
```
This creates a vector of window numbers and rank within each window and adds them to the var.info slot of the chromR object.
We can take a look at them bay calling this directly.
```{r}
head([email protected][,c('POS', 'MQ', 'DP', 'window_number', 'rank')])
```
We can use this information to create our mask.
```{r}
[email protected]$mask[[email protected]$rank > 1] <- FALSE
```
And plot.
```{r, fig.height=7, fig.width=7}
chromoqc(chrom, dp.alpha='66')
```
This looks pretty good.
But we still have variants with rather high or low depth.
We can combine the use of masker, which we explored in the vignette 'Filtering data' with our ranks.
We'll first call masker, which will reset our mask, and then censor this mask based on rank.
```{r}
chrom <- masker( chrom, min_QUAL=0, min_DP=350, max_DP=650, min_MQ=59.5, max_MQ=60.5 )
[email protected]$mask[ [email protected]$rank > 1 ] <- FALSE
```
Then replot.
```{r, fig.height=7, fig.width=7}
chromoqc(chrom, dp.alpha='66')
```
## Conclusion
This provides another tool to help filter variant files to the highest quality fraction.
In a previous vignette we used the function masker() to filter the data.
Here we've created a composite score which we'd like to maximize and ranked the variants based on theis score within windows.
A strength of this method is that by using windows we're able to evenly space our variants accross a chromosome.
Choosing the best, or several best, variants per window does not necessarily guaranty high quality variants.
If all of the variants in a window are of low quality then the best of these may still be poor quality.
Some some additional processing may be necessary.
With these tools it is hoped that an individual can rapidly explore their data and determine a method to extract the highest quality variants so that downstream analyses will be of the highest quality possible.