ranking_data.Rmd

---
title: "Ranking data"
output:
  html_document
---


In the vignette 'Filtering data' we used thresholds as an attempt to isolate the high quality fraction of variants from a VCF file.
Here we assign ranks to variants within windows.
This information used alone, or in conjunction with thresholds, may be an effective strategy to identify high quality variants.


## Data

As in other vignettes, we begin by loading the example data.

```{r}
library(vcfR)

vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz", package = "pinfsc50")
dna_file <- system.file("extdata", "pinf_sc50.fasta", package = "pinfsc50")
gff_file <- system.file("extdata", "pinf_sc50.gff", package = "pinfsc50")

vcf <- read.vcfR(vcf_file, verbose = FALSE)
dna <- ape::read.dna(dna_file, format = "fasta")
gff <- read.table(gff_file, sep="\t", quote = "")

chrom <- create.chromR(name="Supercontig", vcf=vcf, seq=dna, ann=gff, verbose=FALSE)
#chrom <- masker(chrom, min_DP = 900, max_DP = 1500)
chrom <- proc.chromR(chrom, verbose = TRUE)

```


## Creating scores to rank


Before we can rank our variants, we need to come up with some sort of criteria to help us determine if a variant is high or low quality.
Once we have this score we can select the variant with the highest score from each window.
In order to create our vector of scores, let's remind ourselves of what data we have.


```{r}
head(chrom)
```


Let's use the genotype quality (GQ) and sequence depth (DP) from the VCF genotype information.
We can isolate matrices of genotype quality and sequence depth with the extract.gt function.


```{r}
gq <- extract.gt(chrom, element="GQ", as.numeric=TRUE)
dp <- extract.gt(chrom, element="DP", as.numeric=TRUE)
```


We can visualize these data with box and whisker plots.


```{r, fig.align='center', fig.width=7, fig.height=5}
#hist(gq[,1])
par( mar = c(8,4,4,2) )
boxplot(gq, las=2, col=2:5, main="Genotype Quality (GQ)")

dp2 <- dp
dp2[ dp2 == 0 ] <- NA
boxplot(dp2, las=2, col=2:5, main="Sequence Depth (DP)", log="y")
abline(h=10^c(0:4), lty=3, col="#808080")
par( mar = c(5,4,4,2) )
```


The values for genotype quality appear to range from 0 to 100 with among sample variability.
For example, sample P13626 consists of variants which are predominantly near 100 while sample P1362 consists of variants with qualities mostly just below 20.
Comparison of the plots suggests that there is a correlation among sequence depth (DP) and genotype qualities (GQ) where samples with variants of high sequence depth have variants of high genotype quality.


Unlike genotype quality, we don't necessarily want to maximize on sequence depth.
Low depth variants may make obvious poor choices, but excessive coverage may represent variants from repetitive regions of the genome.
What we really want to optimize on is mean depth, or some other measure of central tendency.
This will require a little mathematical gymnastics.
If we substract from each library its mean (or other measure of central tendency) it will center the data around zero.
We can then take an absolute value which will cause the data to range from zero to some infinite number with zero being our optimal value (the measure of central tendency).
The algorithm we're going to use looks for an optimum and not a minimum, so if we multiply by negative one our data will range from negative infinity to zero with zero being optimal.
We now have a measure of depth where the greatest value is the optimal value.


```{r}
mids <- apply(dp, MARGIN=2, median, na.rm=TRUE)
dp2 <- sweep(dp, MARGIN=2, mids, FUN="-")
dp2 <- abs(dp2)
dp2 <- -1 * dp2
```


```{r, fig.align='center', fig.width=7, fig.height=5}
par( mar = c(8,4,4,2) )
boxplot(dp2, las=2, col=2:5, main="Sequence Depth (DP)")
par( mar = c(5,4,4,2) )
```


Before we combine these data we have one more issue we need to address.
In their current state, sequence depth's range is much greater than genotype quality.
This means that the data are effectively weighted, if we simply add them together the sequence depth will have a greater impact on the final metric than will genotype quality.
If we are happy with that then we can proceed.
If we would like to equalize each metric's contribution to our final measure of quality we'll want to normalize the data.
The genotype quality data is fairly straight forward.
If we divide each library by 100 (their theoretical maximum) they will scale from 0 to 1 instead of 0 to 100.
For the sequence depth we can add the absolute value of the minimum value to each library, this will make all of the data positive.
Then we can divide by this value and our data should then scale from 0 to 1.


```{r}
gq2 <- gq/100
range(gq2, na.rm=TRUE)

amins <- abs(apply(dp2, MARGIN=2, min, na.rm = TRUE))
dp2 <- sweep(dp2, MARGIN=2, STATS = amins, FUN="+")
dp2 <- sweep(dp2, MARGIN=2, STATS = amins, FUN="/")
range(dp2, na.rm=TRUE)
```


We now have metrics which are fairly equal.
We can add them together and summarize over variants.


```{r, fig.align='center'}
scores <- dp2 + gq2
scores <- rowSums(scores, na.rm = TRUE)
```


Check their distribution with a histogram.


```{r, fig.align='center'}
hist(scores, col=4)
```


Once we have scores in hand we can use them to rank our variants.


```{r}
chrom <- rank.variants.chromR(chrom, scores)
head(chrom@var.info)
```


This creates a vector of window numbers and rank within each window and adds them to the var.info slot of the chromR object.
We can take a look at them bay calling this directly.


```{r}
head(chrom@var.info[,c('POS', 'MQ', 'DP', 'window_number', 'rank')])
```


We can use this information to create our mask.


```{r}
chrom@var.info$mask[chrom@var.info$rank > 1] <- FALSE
```


And plot.


```{r, fig.height=7, fig.width=7}
chromoqc(chrom, dp.alpha='66')
```


This looks pretty good.
But we still have variants with rather high or low depth.
We can combine the use of masker, which we explored in the vignette 'Filtering data' with our ranks.
We'll first call masker, which will reset our mask, and then censor this mask based on rank.


```{r}
chrom <- masker( chrom, min_QUAL=0, min_DP=350, max_DP=650, min_MQ=59.5, max_MQ=60.5 )
chrom@var.info$mask[ chrom@var.info$rank > 1 ] <- FALSE
```


Then replot.


```{r, fig.height=7, fig.width=7}
chromoqc(chrom, dp.alpha='66')
```


## Conclusion


This provides another tool to help filter variant files to the highest quality fraction.
In a previous vignette we used the function masker() to filter the data.
Here we've created a composite score which we'd like to maximize and ranked the variants based on theis score within windows.
A strength of this method is that by using windows we're able to evenly space our variants accross a chromosome.
Choosing the best, or several best, variants per window does not necessarily guaranty high quality variants.
If all of the variants in a window are of low quality then the best of these may still be poor quality.
Some some additional processing may be necessary.
With these tools it is hoped that an individual can rapidly explore their data and determine a method to extract the highest quality variants so that downstream analyses will be of the highest quality possible.