-
Notifications
You must be signed in to change notification settings - Fork 13
/
vcf_data.Rmd
211 lines (140 loc) · 8.5 KB
/
vcf_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
---
title: "VCF data"
output:
html_document:
toc: true
toc_float: true
---
Most variant calling pipelines result in files containing variant information.
The [variant call format (VCF)](http://samtools.github.io/hts-specs/ "VCF format at hts-specs") is a popular format for this data.
Variant callers typically attempt to agressively call variants with the perspective that a downstream quality control step will remove low quality variants.
A first step in working with this data is to understand their contents.
## Three sections
A VCF file can be thought of as having three sections: a **meta region**, a **fix region** and a **gt region**.
The meta region is at the top of the file.
The information in the meta region defines the abbreviations used elsewhere in the file.
It may also document software used to create the file as well as parameters used by this software.
Below the meta region, the data are tabular.
The first eight columns of this table contain information about each variant.
This data may be common over all variants, such as its chromosomal position, or a summary over all samples, such as quality metrics.
These data are fixed, or the same, over all samples.
The fix region is required in a VCF file, subsequent columns are optional but are common in my experience.
Beginning at column ten is a column for every sample.
The values in these columns are information for each sample and each variant.
The organization of each cell containing a genotype and associated information is specified in column nine.
The location of these three regions within a file can be represented by the cartoon below.
```{r, fig.cap="Cartoon representation of VCF file organization", echo=FALSE, fig.height=4, fig.width=4, fig.align='center', }
par(mar=c(0.1,0.1,0.1,0.1))
plot(c(0,5), c(0,5), type="n", frame.plot=FALSE, axes=FALSE, xlab="", ylab="")
rect(xleft=0, ybottom=4, xright=3, ytop=5)
rect(xleft=0, ybottom=0, xright=2, ytop=4)
rect(xleft=2, ybottom=0, xright=5, ytop=4)
text(1.5, 4.7, "Meta information", cex=1)
text(1.5, 4.4, "(@meta)", cex=1)
text(1.0, 2.5, "Fixed information", cex=1)
text(1.0, 2.2, "(@fix)", cex=1)
text(3.5, 2.5, "Genotype information", cex=1)
text(3.5, 2.2, "(@gt)", cex=1)
par(mar=c(5,4,4,2))
```
The VCF file specification is flexible.
This means that there are slots for certain types of data, but any particular software which creates a VCF file does not necessarily use them all.
Similarly, authors have the opportunity to include new forms of data, forms which may not have been foreseen by the authors of the VCF specification.
The result is that all VCF files do not contain the same information.
For this example, we'll use example data provided with vcfR.
```{r}
library(vcfR)
data(vcfR_example)
vcf
```
The function `library()` loads libraries, in this case the package vcfR.
The function `data()` loads datasets that were included with R and its packages.
Our usage of `data()` loads the objects 'gff', 'dna' and 'vcf' from the 'vcfR_example' dataset.
Here we're only interested in the object 'vcf' which contains example VCF data.
When we call the object name with no function it invokes the 'show' method which prints some summary information to the console.
## The meta region
The meta region contains information about the file, its creation, as well as information to interpret abbreviations used elsewhere in the file.
Each line of the meta region begins with a double pound sign ('##').
The example which comes with vcfR is shown below.
(Only the first 10 lines are shown for brevity.)
```{r, echo=TRUE, tidy=TRUE}
strwrap(vcf@meta[1:7])
```
The first line contains the version of the VCF format used in the file.
This line is required.
The second line specifies the software which created the VCF file.
This is not required, so not all VCF files include it.
When they do, the file becomes self documenting.
Note that the alignment software is not included here because it was used upstream of the VCF file's creation (aligners typically create \*.SAM or \*.BAM format files).
Because the file can only include information about the software that created it, the entire pipeline does not get documented.
Some VCF files may contain a line for every chromosome (or supercontig or contig depending on your genome), so they may become rather long.
Here, the remaining lines contain INFO and FORMAT specifications which define abbreviations used in the fix and gt portions of the file.
The meta region may include long lines that may not be easy to view.
In vcfR we've created a function to help prcess this data.
```{r}
queryMETA(vcf)
```
When the function `queryMETA()` is called with only a vcfR object as a parameter, it attempts to summarize the meta information.
Not all of the information is returned.
For example, 'contig' elements are not returned.
This is an attempt to summarize information that may be most useful for comprehension of the file's contents.
```{r}
queryMETA(vcf, element = 'DP')
```
When an element parameter is included, only the information about that element is returned.
In this example the element 'DP' is returned.
We see that this acronym is defined as both a 'FORMAT' and 'INFO' acronym.
We can narrow down our query by including more information in the element parameter.
```{r}
queryMETA(vcf, element = 'FORMAT=<ID=DP')
```
Here we've isolated the definition of 'DP' as a 'FORMAT' element.
Note that the function `queryMETA()` includes the parameter `nice` which by default is TRUE and attempts to present the data in a nicely formatted manner.
However, our query is performed on the actual information in the 'meta' region.
It is therefore sometimes appropriate to set `nice = FALSE` so that we can see the raw data.
In the above example the angled bracket ('<') is omitted from the `nice = TRUE` representation but is essential to distinguishing the 'FORMAT' element from the 'INFO' element.
## The fix region
The fix region contains information for each variant which is sometimes summarized over all samples.
The first eight columns of the fixed region and are titled CHROM, POS, ID, REF, ALT, QUAL, FILTER and INFO.
This is per variant information which is 'fixed', or the same, over all samples.
The first two columns indicate the location of the variant by chromosome and position within that chromosome.
Here, the ID field has not been used, so it consists of missing data (NA).
The REF and ALT columns indicate the reference and alternate allelic states.
When multiple alternate allelic states are present they are delimited with commas.
The QUAL column attempts to summarize the quality of each variant over all samples.
The FILTER field is not used here but could contain information on whether a variant has passed some form of quality assessment.
```{r, echo=TRUE}
head(getFIX(vcf))
```
The eigth column, titled INFO, is a semicolon delimited list of information.
It can be rather long and cumbersome.
The function `getFIX()` will suppress this column by default.
Each abbreviation in the INFO column should be defined in the meta section.
We can validate this by querying the meta portion, as we did in the 'meta' section above.
## The gt region
The gt (genotype) region contains information about each variant for each sample.
The values for each variant and each sample are colon delimited.
Multiple types of data for each genotype may be stored in this manner.
The format of the data is specified by the FORMAT column (column nine).
Here we see that we have information for GT, AD, DP, GQ and PL.
The definition of these acronyms can be referenced by querying the the meta region, as demonstrated previously.
Every variant does not necessarily have the same information (e.g., SNPs and indels may be handled differently), so the rows are best treated independently.
Different variant callers may include different information in this region.
```{r, echo=TRUE, tidy=TRUE}
vcf@gt[1:6, 1:4]
```
## vcfR
Using the R package vcfR, we can read VCF format files into memory using the function `read.vcfR()`.
Once in memory we can use the `head()` method to summarize the information in the three VCF regions.
```{r, eval = FALSE}
vcf <- read.vcfR("myVCFdata.vcf.gz")
```
```{r}
head(vcf)
```
After we have made any manipulations of the file we can save it as a VCF file with the function `write.vcf()`.
```{r, eval = FALSE}
write.vcf(vcf, "myVCFdata_filtered.vcf.gz")
```
We now have a summary of our VCF file which we can use to help understand what forms of information are contained within it.
This information can be further explored with plotting functions and used to filter the VCF file for high quality variants.