-
Notifications
You must be signed in to change notification settings - Fork 17
/
circulating_variants.Rmd
487 lines (368 loc) · 33.2 KB
/
circulating_variants.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
---
title: "Circulating SARS-CoV-2 RBD variants"
author: "Tyler Starr"
date: "5/12/2020"
output:
github_document:
toc: true
html_preview: false
editor_options:
chunk_output_type: inline
---
This notebook analyzes RBD variants that have been sampled in isolates within the current SARS-CoV-2 pandemic.
## Setup
```{r setup, message=FALSE, warning=FALSE, error=FALSE}
require("knitr")
knitr::opts_chunk$set(echo = T)
knitr::opts_chunk$set(dev.args = list(png = list(type = "cairo")))
#list of packages to install/load
packages = c("yaml","data.table","tidyverse","gridExtra","bio3d","seqinr")
#install any packages not already installed
installed_packages <- packages %in% rownames(installed.packages())
if(any(installed_packages == F)){
install.packages(packages[!installed_packages])
}
#load packages
invisible(lapply(packages, library, character.only=T))
#read in config file
config <- read_yaml("config.yaml")
#read in file giving concordance between RBD numbering and SARS-CoV-2 Spike numbering
RBD_sites <- data.table(read.csv(file="data/RBD_sites.csv",stringsAsFactors=F))
#make output directory
if(!file.exists(config$circulating_variants_dir)){
dir.create(file.path(config$circulating_variants_dir))
}
```
Session info for reproducing environment:
```{r print_sessionInfo}
sessionInfo()
```
Read in tables of variant effects on binding and expression for single mutations to the SARS-CoV-2 RBD.
```{r read_data}
mutants <- data.table(read.csv(file=config$single_mut_effects_file,stringsAsFactors = F))
#rename mutants site indices to prevent shared names with RBD_sites, simplifying some downstream calculations that cross-index these tables
setnames(mutants, "site_RBD", "RBD_site");setnames(mutants, "site_SARS2", "SARS2_site")
```
## Analyzing amino acid diversity in GISAID Spike sequences
We constructed an alignment of all Spike sequences available on GISAID as of 27 May, 2020. On the EpiCoV page, under downloads, one of the pre-made options is a fasta of all Spike sequences isolated thus far, which is updated each day. I have downloaded this file, unzipped, replaced spaces in fasta headers with underscores, and aligned sequences. We load in this alignment using the `read.fasta` function of the `bio3d` package, and trim the alignment to RBD residues. We remove sequecnes from non-human isolates (e.g. bat, pangolin, "environment", mink, cat, TIGER) and sequences with gap `-` characters, and then iterate through the alignment and save any observed mutations. We then filter mutations based on rationale below, and add counts of filtered observations for each mutation as an 'nobs' colum in our overall mutants data table.
We filter out any mutations that were *only* observed on sequences with large numbers of missing `X` characters -- from my initial pass, I saw some singleton amino acid variants which would require >1 nt change, and these were only found in a single sequence with many X amino acid characters (the first half of the sequence was well determined, but the second half was all X's, with the annotated "differences" being within short stretches between Xs with determiined amino acids), which made me realize I needed to be careful not only of sequences rich in gap "-" characters, but also ambiguous "X" characters. However, I didn't want to remove all sequences with undetermined characters off the bat, because another pattern I saw is that for isolates bearing the N439K mutation, >10 are well determined across the entire RBD, but ~80 have many X characters (in part of the RBD that is *not* near the N439K sequence call). So, my preference would be to believe a mutation observed in an X-rich sequence *if the variant in the sequence is observed in at least one variant that does not contain an X*, but not believe mutations that are *only* observed in X-rich sequences. (I noticed this issue with N439K, but this is not the only mutation like this which is observed on 0X sequences at least once but other times on sequences with X characters.) That is the filtering I therefore do below. This is basically the limits of my genomic dataset sleuthing. Is there anything else we should be doing to assess validity of observed mutants, particularly e.g. could N439K simply be a biased sequencing error that emerges in these Scotland sequencing samples? Would love ideas or help in better filtering amino acid variants to retain.
```{r gisaid_spike_alignment}
alignment <- bio3d::read.fasta(file="data/alignments/Spike_GISAID/spike_GISAID_aligned.fasta", rm.dup=T)
#remove non-human samples
keep <- grep("Human",alignment$id); alignment$ali <- alignment$ali[keep,]; alignment$id <- alignment$id[keep]
#remove columns that are gaps in first reference sequence
alignment$ali <- alignment$ali[,alignment$ali[1,]!="-"]
alignment_RBD <- alignment; alignment_RBD$ali <- alignment$ali[,RBD_sites$site_SARS2]
#check that the first sequence entry matches our reference RBD sequence
stopifnot(sum(!(alignment_RBD$ali[1,] == RBD_sites[,amino_acid_SARS2]))==0)
#remove sequences have gaps, as the amino acid calls may be generally unreliable
remove <- c()
for(i in 1:nrow(alignment_RBD$ali)){
if(sum(alignment_RBD$ali[i,]=="-") > 0){remove <- c(remove,i)}
}
alignment_RBD$ali <- alignment_RBD$ali[-remove,];alignment_RBD$id <- alignment_RBD$id[-remove]
#output all mutation differences from the WT/reference RBD sequence
#I do this by iterating over rows and columns of the alignment matrix which is STUPID but effective
variants_vec <- c()
isolates_vec <- c()
for(j in 1:ncol(alignment_RBD$ali)){
#print(i)
for(i in 1:nrow(alignment_RBD$ali)){
if(alignment_RBD$ali[i,j] != alignment_RBD$ali[1,j] & !(alignment_RBD$ali[i,j] %in% c("X","-"))){
variants_vec <- c(variants_vec, paste(alignment_RBD$ali[1,j],j,alignment_RBD$ali[i,j],sep=""))
isolates_vec <- c(isolates_vec, alignment_RBD$id[i])
}
}
}
#remove any mutations that are *only* observed in X-rich sequences of dubious quality (keep counts in X-rich sequences if they were observed in at least one higher quality isolate)
#make a data frame that gives each observed mutation, the isolate it was observed in, and the number of X characters in that sequence. Also, parse the header to give the country/geographic division of the sample
variants <- data.frame(isolate=isolates_vec,mutation=variants_vec)
for(i in 1:nrow(variants)){
variants$number_X[i] <- sum(alignment_RBD$ali[which(alignment_RBD$id == variants[i,"isolate"]),]=="X")
variants$geography[i] <- strsplit(as.character(variants$isolate[i]),split="/")[[1]][2]
}
#filter the sequence set for mutations observed in at least one X=0 background
variants_filtered <- data.frame(mutation=unique(variants[variants$number_X==0,"mutation"])) #only keep variants observed in at least one sequence with 0 X
for(i in 1:nrow(variants_filtered)){
variants_filtered$n_obs[i] <- sum(variants$mutation == variants_filtered$mutation[i]) #but keep counts for any sequence with observed filtered muts
variants_filtered$n_geography[i] <- length(unique(variants[variants$mutation == variants_filtered$mutation[i],"geography"]))
variants_filtered$list_geography[i] <- list(list(unique(variants[variants$mutation == variants_filtered$mutation[i],"geography"])))
}
#add count to mutants df
mutants[,nobs:=0]
mutants[,ngeo:=0]
mutants[,geo_list:=as.list(NA)]
for(i in 1:nrow(mutants)){
if(mutants$mutation_RBD[i] %in% variants_filtered$mutation){
mutants$nobs[i] <- variants_filtered[variants_filtered$mutation==mutants$mutation_RBD[i],"n_obs"]
mutants$ngeo[i] <- variants_filtered[variants_filtered$mutation==mutants$mutation_RBD[i],"n_geography"]
mutants$geo_list[i] <- variants_filtered[variants_filtered$mutation==mutants$mutation_RBD[i],"list_geography"]
}
}
```
We see `r sum(mutants$nobs)` amino acid polymorphisims within the `r nrow(alignment_RBD$ali)` sequences uploaded in GISAID, which represents `r sum(mutants$nobs>0)` of our `r nrow(mutants[mutant!=wildtype & mutant!="*",])` measured missense mutants. In the table below, we can see that many of these mutations are observed only one or a few times, so there may still be unaccounted for sequencinig artifacts, which we tried to account for at least minimally with some filtering above.
```{r table_circulating_variants_nobs}
kable(table(mutants[mutant!=wildtype & mutant!="*",nobs]),col.names=c("mutation count","frequency"))
```
We plot each mutations experimental phenotype versus the number of times it is observed in the circulating Spike alignment, for binding (top) and expression (bottom), with the righthand plots simply zooming in on the region surrounding zero for better visualization. We can see that some of the mutations that are observed just one or a couple of times are highly deleterious, and anything sampled more than a handful of times exhibits ~neutral or perhaps small positive binding and/or expression effects.
```{r scatter_circulating_variants_nobs, fig.width=8,fig.height=8.5,fig.align="center", dpi=500,dev="png",echo=FALSE}
par(mfrow=c(2,2))
plot(mutants[nobs>0,nobs],mutants[nobs>0,bind_avg],pch=16,cex=1.2,col="#00000050",xlab="Number of observations among GISAID Spikes",ylab="delta log10(Ka,app)");abline(h=0,lty=2)
plot(mutants[nobs>0,nobs],mutants[nobs>0,bind_avg],pch=16,cex=1.2,col="#00000050",xlab="Number of observations among GISAID Spikes",ylab="delta log10(Ka,app)",ylim=c(-0.5,0.1));abline(h=0,lty=2)
plot(mutants[nobs>0,nobs],mutants[nobs>0,expr_avg],pch=16,cex=1.2,col="#00000050",xlab="Number of observations among GISAID Spikes",ylab="delta expression meanF");abline(h=0,lty=2)
plot(mutants[nobs>0,nobs],mutants[nobs>0,expr_avg],pch=16,cex=1.2,col="#00000050",xlab="Number of observations among GISAID Spikes",ylab="delta expression meanF",ylim=c(-1,1));abline(h=0,lty=2)
invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/phenotype_v_GISAID-nobs.pdf",sep="")))
```
We also make plots showing mutational effects versus the number of geographic regions in which a mutation has been observed.
```{r scatter_circulating_variants_ngeography, fig.width=8,fig.height=8.5,fig.align="center", dpi=500,dev="png",echo=FALSE}
par(mfrow=c(2,2))
plot(mutants[nobs>0,ngeo],mutants[nobs>0,bind_avg],pch=16,cex=1.2,col="#00000050",xlab="Number of geographical observations",ylab="delta log10(Ka,app)");abline(h=0,lty=2)
plot(mutants[nobs>0,ngeo],mutants[nobs>0,bind_avg],pch=16,cex=1.2,col="#00000050",xlab="Number of geographical observations",ylab="delta log10(Ka,app)",ylim=c(-0.5,0.1));abline(h=0,lty=2)
plot(mutants[nobs>0,ngeo],mutants[nobs>0,expr_avg],pch=16,cex=1.2,col="#00000050",xlab="Number of geographical observations",ylab="delta expression meanF");abline(h=0,lty=2)
plot(mutants[nobs>0,ngeo],mutants[nobs>0,expr_avg],pch=16,cex=1.2,col="#00000050",xlab="Number of geographical observations",ylab="delta expression meanF",ylim=c(-1,1));abline(h=0,lty=2)
invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/phenotype_v_GISAID-nobs.pdf",sep="")))
```
Here are tables giving mutations that were seen >20 times, and those seen any number of times with measured binding effects >0.05. We are currently slotted to validate the effect of N439K in both yeast display and pseudovirus/mammalian experimental assays, and V367F, T478I, and V483A in yeast display. (S477N just came online as being prevalent with the newest GISAID set of sequences we used, after we had started cloning for validations.)
```{r table_most_common_variants, echo=F}
kable(mutants[nobs>20,.(mutation,expr_lib1,expr_lib2,expr_avg,bind_lib1,bind_lib2,bind_avg,nobs,ngeo)],
col.names=c("Mutation","expr, lib1","expr, lib2","expression effect","bind, lib1","bind, lib2", "binding effect","number of GISAID sequences", "number locations"))
```
```{r table_highest_binding_variants, echo=F}
kable(mutants[nobs>0 & bind_avg>0.05,.(mutation,expr_lib1,expr_lib2,expr_avg,bind_lib1,bind_lib2,bind_avg,nobs,ngeo)],
col.names=c("Mutation","expr, lib1","expr, lib2","expression effect","bind, lib1","bind, lib2", "binding effect","number of GISAID sequences", "number locations"))
```
Let's visualize the positions with interesting circulating variants in our *favorite* exploratory heatmaps! Below, we output maps first for positions with circulating variants observed >20 times (left two maps), and second for those positions with circulating variants of at least >0.05 effect on binding (right two maps). These maps show the SARS-CoV-2 wildtype state with an "x" indicator, the SARS-CoV-1 state with an "o", and "^" marks any amino acid variants observed at least one time in the GISAID sequences.
```{r heatmap_circulating_variants, fig.width=8,fig.height=4,fig.align="center", dpi=500,dev="png",echo=FALSE}
#order mutant as a factor for grouping by rough biochemical grouping
mutants$mutant <- factor(mutants$mutant, levels=c("*","C","P","G","V","M","L","I","A","F","W","Y","T","S","N","Q","E","D","H","K","R"))
#add character vector indicating wildtype to use as plotting symbols for wt
mutants[,wildtype_indicator := ""]
mutants[mutant==wildtype,wildtype_indicator := "x"]
#indicator for wildtype SARS-CoV-1 state
mutants[,SARS1_indicator := ""]
for(i in 1:nrow(mutants)){
SARS1aa <- RBD_sites[site_SARS2==mutants$SARS2_site[i],amino_acid_SARS1]
if(!is.na(SARS1aa) & mutants$mutant[i] == SARS1aa){mutants$SARS1_indicator[i] <- "o"}
}
#add indicator "^" for observed variant amino acids
mutants[,variant_indicator:=""]
mutants[nobs>0,variant_indicator:="^"]
#make smaller data frame of positions with "high" frequency variants
muts_temp <- mutants[SARS2_site %in% mutants[nobs>20,SARS2_site],]; muts_temp$SARS2_site <- as.factor(muts_temp$SARS2_site)
p1 <- ggplot(muts_temp,aes(SARS2_site,mutant))+geom_tile(aes(fill=bind_avg))+
scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,0.5),values=c(0,2.5/5.5,5/5.5,5.25/5.5,5.5/5.5),na.value="gray")+
scale_x_discrete(expand=c(0,0))+
labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
geom_text(aes(label=wildtype_indicator),size=2,color="gray10")+
geom_text(aes(label=variant_indicator),size=2,color="gray10")+
geom_text(aes(label=SARS1_indicator),size=2,color="gray10")
p2 <- ggplot(muts_temp,aes(SARS2_site,mutant))+geom_tile(aes(fill=expr_avg))+
scale_fill_gradientn(colours=c("#A94E35","#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,1),values=c(0,1/6,3/6,5/6,5.5/6,6/6),na.value="gray")+
scale_x_discrete(expand=c(0,0))+
labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
geom_text(aes(label=wildtype_indicator),size=2,color="gray10")+
geom_text(aes(label=variant_indicator),size=2,color="gray10")+
geom_text(aes(label=SARS1_indicator),size=2,color="gray10")
#make smaller data frame of positions with "high" frequency variants
muts_temp <- mutants[SARS2_site %in% mutants[nobs>0 & bind_avg > 0.05,SARS2_site],]; muts_temp$SARS2_site <- as.factor(muts_temp$SARS2_site)
p3 <- ggplot(muts_temp,aes(SARS2_site,mutant))+geom_tile(aes(fill=bind_avg))+
scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,0.5),values=c(0,2.5/5.5,5/5.5,5.25/5.5,5.5/5.5),na.value="gray")+
scale_x_discrete(expand=c(0,0))+
labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
geom_text(aes(label=wildtype_indicator),size=2,color="gray10")+
geom_text(aes(label=variant_indicator),size=2,color="gray10")+
geom_text(aes(label=SARS1_indicator),size=2,color="gray10")
p4 <- ggplot(muts_temp,aes(SARS2_site,mutant))+geom_tile(aes(fill=expr_avg))+
scale_fill_gradientn(colours=c("#A94E35","#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,1),values=c(0,1/6,3/6,5/6,5.5/6,6/6),na.value="gray")+
scale_x_discrete(expand=c(0,0))+
labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
geom_text(aes(label=wildtype_indicator),size=2,color="gray10")+
geom_text(aes(label=variant_indicator),size=2,color="gray10")+
geom_text(aes(label=SARS1_indicator),size=2,color="gray10")
grid.arrange(p1,p2,p3,p4,ncol=4,widths=c(1,1,1.15,1.15))
invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/heatmaps_circulating_variants.pdf",sep="")))
```
## Strength of selection among circulating variants
To characterize the effect of selection, we can compare the distribution of functional effects of mutations at different n_obs cutoffs, compared to the "raw" distribution of functional effects -- in this case, we should look at the DFE of only those amino acid mutations that can be introduced with single nucleotide mutations given the SARS-CoV-2 reference nt sequence.
First, we add a column to our mutants data frame indicating whether a mutation is accessible by single nucleotide mutations.
```{r single_codon_muts}
#define a function that takes a character of three nucleotides (a codon), and outputs all amino acids that can be accessed via single-nt mutation of that codon
get.codon.muts <- function(codon){
nt <- c("a","c","g","t")
codon_split <- strsplit(codon,split="")[[1]]
codon_muts <- vector()
for(i in nt[nt!=codon_split[1]]){
codon_muts <- c(codon_muts,seqinr::translate(c(i,codon_split[2:3])))
}
for(i in nt[nt!=codon_split[2]]){
codon_muts <- c(codon_muts,seqinr::translate(c(codon_split[1],i,codon_split[3])))
}
for(i in nt[nt!=codon_split[3]]){
codon_muts <- c(codon_muts,seqinr::translate(c(codon_split[1:2],i)))
}
return(codon_muts)
}
mutants[,SARS2_codon:=RBD_sites[site_SARS2==SARS2_site,codon_SARS2],by=mutation]
mutants[,singlemut := mutant %in% get.codon.muts(SARS2_codon),by=mutation]
```
Are any of our observed GISAID mutations >1nt changes? In the current alignment, no!
```{r table_GISAID_multimuts}
kable(mutants[singlemut==F & nobs>0,.(mutation_RBD,mutation,expr_lib1,expr_lib2,expr_avg,bind_lib1,bind_lib2,bind_avg,nobs,SARS2_codon)])
```
Below is a heatmap of binding effects for all mutations accessible in single nucleotide changes from the SARS-CoV-2 WT reference sequence, with others grayed out. We can see that several affinity-enhancing mutations (including at least two that we are currently planning to validiate), are accessible with single-nt mutations. Therefore, affinity-enhancing mutations, if selectively beneficial, would be readily accessible via mutation. However, there are probably some positions where beneficial mutations are possible, but none are available from single-nt mutations which we could dig into if interesting.
```{r heatmap_binding_single_muts, fig.width=12,fig.height=6,fig.align="center", dpi=500,dev="png",echo=FALSE}
muts_temp <- copy(mutants)
muts_temp[singlemut==F,bind_avg:=NA]
muts_temp[singlemut==F,expr_avg:=NA]
p1 <- ggplot(muts_temp[mutant!="*" & SARS2_site %in% seq(331,431),],aes(SARS2_site,mutant))+geom_tile(aes(fill=bind_avg))+
scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,0.5),values=c(0,2.5/5.5,5/5.5,5.25/5.5,5.5/5.5),na.value="gray")+
scale_x_continuous(expand=c(0,0),breaks=c(331,seq(340,430,by=5)))+
labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
coord_equal()+
geom_text(aes(label=wildtype_indicator),size=2,color="gray10")
p2 <- ggplot(muts_temp[mutant!="*" & SARS2_site %in% seq(432,531),],aes(SARS2_site,mutant))+geom_tile(aes(fill=bind_avg))+
scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,0.5),values=c(0,2.5/5.5,5/5.5,5.25/5.5,5.5/5.5),na.value="gray")+
scale_x_continuous(expand=c(0,0),breaks=c(432,seq(440,530,by=5)))+
labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
coord_equal()+
geom_text(aes(label=wildtype_indicator),size=2,color="gray10")
grid.arrange(p1,p2,ncol=1)
invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/single-nt-mut_heatmap.pdf",sep="")))
```
As has been seen in other studies, the genetic code is structured in a conservative way such that neighboring codons exhibit similiar biochemical properties, which causes single-nt amino acid changes to have less deleterious effects than multiple-nt-mutant codon changes. We see this trend in our data as well, further illustrating why we should compare circulating variants to the single-nt-mutant amino acid effects as the "raw" distribution of functional effects. (Median mutational effect of single-nt mutants = `r median(mutants[mutant!=wildtype & singlemut==T,bind_avg],na.rm=T)`; median mutational effect of all amino acid muts = `r median(mutants[mutant!=wildtype,bind_avg],na.rm=T)`; P-value `r round(wilcox.test(mutants[mutant!=wildtype & singlemut==T,bind_avg],mutants[mutant!=wildtype,bind_avg])$p.value,digits=5)`, Wilcoxon rank-sum test.)
To illustrate how selection acts on circulating variants, let's compare the functional effects of mutations, binned by number of observations in GISAID. The violin plots below show the distribution of functional effects on binding (left) and expression (right), comparing single-nt amino acid mutations with 0 observed counts in GISAID versus increasingly stringent GISAID count cutoffs. We can see a bias among circulating mutants for both binding and expression effects that are visually by eye higher than expected by random mutation alone. This suggests that purifying selection is removing deleterious RBD mutations that affect traits correlated with our measured binding and expression phenotypes. We may also see evidence that there is not strong positive selection for enhanced ACE2-binding affinity, as there are single mutations that can cause larger affinity increases than are actually seen in the GISAID set. We can evaluate these two conclusions further below with permutation tests.
```{r circulating_variant_DFEs, fig.width=8,fig.height=4,fig.align="center", dpi=500,dev="png",echo=FALSE}
#define factor for different nobs cutoffs, and collate into long data table for plotting
muts_temp <- mutants[singlemut==TRUE & wildtype!=mutant & mutant!="*", ]
muts_temp[,nobs_indicator:=as.factor("all single nt muts")]
muts_temp_add0 <- mutants[nobs>=1 & singlemut==TRUE & wildtype!=mutant & mutant!="*", ]
muts_temp_add0[,nobs_indicator:=as.factor(">=1")]
muts_temp_add1 <- mutants[nobs>=2 & singlemut==TRUE & wildtype!=mutant & mutant!="*", ]
muts_temp_add1[,nobs_indicator:=as.factor(">=2")]
muts_temp_add5 <- mutants[nobs>=6 & singlemut==TRUE & wildtype!=mutant & mutant!="*", ]
muts_temp_add5[,nobs_indicator:=as.factor(">=6")]
muts_temp <- rbind(muts_temp,muts_temp_add0,muts_temp_add1,muts_temp_add5)
set.seed(198)
p1 <- ggplot(muts_temp[!is.na(bind_avg),],aes(x=nobs_indicator,y=bind_avg))+
geom_boxplot(outlier.shape=16, width=0.4, outlier.alpha=0.5)+
#geom_jitter(width=0.2, alpha=0.1, shape=16)+
xlab("mutation count in GISAID sequences")+ylab("mutation effect on binding")+
theme_classic()
p2 <- ggplot(muts_temp[!is.na(nobs_indicator) & !is.na(expr_avg),],aes(x=nobs_indicator,y=expr_avg))+
geom_boxplot(outlier.shape=16, width=0.4, outlier.alpha=0.5)+
#geom_jitter(width=0.2, alpha=0.1, shape=16)+
xlab("mutation count in GISAID sequences")+ylab("mutation effect on expression")+
theme_classic()
grid.arrange(p1,p2,ncol=2)
invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/distribution-binding_v_nobs-GISAID.pdf",sep="")))
```
Let's use permutation tests to evaluate whether the shift in median mutational effects among GISAID sequences of different minimum frequency cutoffs is significant relative to the overall distribution of single-mutant effects. We draw samples without replacement from the raw distribution of mutational effects (single-nt mutants only), drawing the same number of random mutations as in our GISAID sets. Within each random sample, we determine the median mutational effect, as well as the highest mutational effect and the fraction of mutations that have binding efects >0. We then compare our actual values to these random samples to evaluate the biases in mutational effects among mutations observed in the GISAID sequences.
The plots below show the distribution of a statistic in our 1e6 sampled sets, and the dashed line shows the actual value in our set of mutants observed in GISAID sequences some number of time -- the first set of plots for mutations observed 1 or more times in GISAID sequences, the second set of plots for mutations observed more than 1 time, and the third set of plots for mutations observed more than 5 times. We evaluate a P-value for each comparison by determining the fraction of subsampled replicates with a value equal to or greater than the actual value. We can see that the median effect of mutations is significantly higher in the GISAID mutation sets than random samples, indicating that purifying selection shapes the mutations that are sampled in GISAID. We do not see strong evidence for positive selection for affinity-enhancing mutations among circulating variants -- a large fraction of randomly sub-sampled sets of mutations sample at least one mutation with a higher affinity-enhancing effect than any seen in our actual dataset, and the fraction of mutations with binding effect >0, though significantly lower in sub-samples, is not strongly diverged, and is conflated by the purifying selection which wiill naturally increase this metric among the observed set of mutants.
```{r permute_samples_0, fig.width=12,fig.height=4,fig.align="center", dpi=500,dev="png",echo=TRUE}
set.seed(18041)
n_rep <- 1000000
median_0 <- vector(length=n_rep)
max_0 <- vector(length=n_rep)
frac_pos_0 <- vector(length=n_rep)
for(i in 1:n_rep){
sample <- sample(mutants[singlemut==T & wildtype!=mutant & mutant!="stop" & !is.na(bind_avg),bind_avg],nrow(mutants[nobs>0,]))
median_0[i] <- median(sample)
max_0[i] <- max(sample)
frac_pos_0[i] <- sum(sample>0)/length(sample)
}
par(mfrow=c(1,3))
hist(median_0,xlab="median mutational effect on binding",col="gray50",main=paste(">0 GISAID observations\nP-value",sum(median_0>=median(mutants[nobs>0,bind_avg]))/length(median_0)));abline(v=median(mutants[nobs>0,bind_avg]),lty=2)
hist(max_0,xlab="max mutational effect on binding",col="gray50",main=paste(">0 GISAID observations\nP-value",sum(max_0>=max(mutants[nobs>0,bind_avg]))/length(max_0)));abline(v=max(mutants[nobs>0,bind_avg]),lty=2)
hist(frac_pos_0,xlab="fraction muts with positive effects on binding",col="gray50",main=paste(">0 GISAID observations\nP-value",sum(frac_pos_0>=sum(mutants[nobs>0,bind_avg]>0)/nrow(mutants[nobs>0,]))/length(frac_pos_0)));abline(v=sum(mutants[nobs>0,bind_avg]>0)/nrow(mutants[nobs>0,]),lty=2)
invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/permutation_greater-than-zero-GISAID.pdf",sep="")))
```
```{r permute_samples_1, fig.width=12,fig.height=4,fig.align="center", dpi=500,dev="png",echo=TRUE}
n_rep <- 1000000
median_1 <- vector(length=n_rep)
max_1 <- vector(length=n_rep)
frac_pos_1 <- vector(length=n_rep)
for(i in 1:n_rep){
sample <- sample(mutants[singlemut==T & wildtype!=mutant & mutant!="stop" & !is.na(bind_avg),bind_avg],nrow(mutants[nobs>1,]))
median_1[i] <- median(sample)
max_1[i] <- max(sample)
frac_pos_1[i] <- sum(sample>0)/length(sample)
}
par(mfrow=c(1,3))
hist(median_1,xlab="median mutational effect on binding",col="gray50",main=paste(">1 GISAID observations\nP-value",sum(median_1>=median(mutants[nobs>1,bind_avg]))/length(median_1)));abline(v=median(mutants[nobs>1,bind_avg]),lty=2)
hist(max_1,xlab="max mutational effect on binding",col="gray50",main=paste(">1 GISAID observations\nP-value",sum(max_1>=max(mutants[nobs>1,bind_avg]))/length(max_1)));abline(v=max(mutants[nobs>1,bind_avg]),lty=2)
hist(frac_pos_1,xlab="fraction muts with positive effects on binding",col="gray50",main=paste(">1 GISAID observations\nP-value",sum(frac_pos_1>=sum(mutants[nobs>1,bind_avg]>0)/nrow(mutants[nobs>1,]))/length(frac_pos_1)));abline(v=sum(mutants[nobs>1,bind_avg]>0)/nrow(mutants[nobs>1,]),lty=2)
invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/permutation_greater-than-one-GISAID.pdf",sep="")))
```
```{r permute_samples_5, fig.width=10,fig.height=4,fig.align="center", dpi=500,dev="png",echo=TRUE}
n_rep <- 1000000
median_5 <- vector(length=n_rep)
max_5 <- vector(length=n_rep)
frac_pos_5 <- vector(length=n_rep)
for(i in 1:n_rep){
sample <- sample(mutants[singlemut==T & wildtype!=mutant & mutant!="stop" & !is.na(bind_avg),bind_avg],nrow(mutants[nobs>5,]))
median_5[i] <- median(sample)
max_5[i] <- max(sample)
frac_pos_5[i] <- sum(sample>0)/length(sample)
}
par(mfrow=c(1,3))
hist(median_5,xlab="median mutational effect on binding",col="gray50",main=paste(">5 GISAID observations\nP-value",sum(median_5>=median(mutants[nobs>5,bind_avg]))/length(median_5)));abline(v=median(mutants[nobs>5,bind_avg]),lty=2)
hist(max_5,xlab="max mutational effect on binding",col="gray50",main=paste(">5 GISAID observations\nP-value",sum(max_5>=max(mutants[nobs>5,bind_avg]))/length(max_5)));abline(v=max(mutants[nobs>5,bind_avg]),lty=2)
hist(frac_pos_5,xlab="fraction muts with positive effects on binding",col="gray50",main=paste(">5 GISAID observations\nP-value",sum(frac_pos_5>=sum(mutants[nobs>5,bind_avg]>0)/nrow(mutants[nobs>5,]))/length(frac_pos_5)));abline(v=sum(mutants[nobs>5,bind_avg]>0)/nrow(mutants[nobs>5,]),lty=2)
invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/permutation_greater-than-five-GISAID.pdf",sep="")))
```
```{r permute_samples_0_expr, fig.width=12,fig.height=4,fig.align="center", dpi=500,dev="png",echo=TRUE}
set.seed(9248)
n_rep <- 1000000
median_0_expr <- vector(length=n_rep)
max_0_expr <- vector(length=n_rep)
frac_pos_0_expr <- vector(length=n_rep)
for(i in 1:n_rep){
sample <- sample(mutants[singlemut==T & wildtype!=mutant & mutant!="stop" & !is.na(expr_avg),expr_avg],nrow(mutants[nobs>0,]))
median_0_expr[i] <- median(sample)
max_0_expr[i] <- max(sample)
frac_pos_0_expr[i] <- sum(sample>0)/length(sample)
}
par(mfrow=c(1,3))
hist(median_0_expr,xlab="median mutational effect on expression",col="gray50",main=paste(">0 GISAID observations\nP-value",sum(median_0_expr>=median(mutants[nobs>0,expr_avg]))/length(median_0_expr)));abline(v=median(mutants[nobs>0,expr_avg]),lty=2)
hist(max_0_expr,xlab="max mutational effect on expression",col="gray50",main=paste(">0 GISAID observations\nP-value",sum(max_0_expr>=max(mutants[nobs>0,expr_avg]))/length(max_0_expr)));abline(v=max(mutants[nobs>0,expr_avg]),lty=2)
hist(frac_pos_0_expr,xlab="fraction muts with positive effects on expression",col="gray50",main=paste(">0 GISAID observations\nP-value",sum(frac_pos_0_expr>=sum(mutants[nobs>0,expr_avg]>0)/nrow(mutants[nobs>0,]))/length(frac_pos_0_expr)));abline(v=sum(mutants[nobs>0,expr_avg]>0)/nrow(mutants[nobs>0,]),lty=2)
invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/permutation_greater-than-zero-GISAID_expr.pdf",sep="")))
```
```{r permute_samples_1_expr, fig.width=12,fig.height=4,fig.align="center", dpi=500,dev="png",echo=TRUE}
n_rep <- 1000000
median_1_expr <- vector(length=n_rep)
max_1_expr <- vector(length=n_rep)
frac_pos_1_expr <- vector(length=n_rep)
for(i in 1:n_rep){
sample <- sample(mutants[singlemut==T & wildtype!=mutant & mutant!="stop" & !is.na(expr_avg),expr_avg],nrow(mutants[nobs>1,]))
median_1_expr[i] <- median(sample)
max_1_expr[i] <- max(sample)
frac_pos_1_expr[i] <- sum(sample>0)/length(sample)
}
par(mfrow=c(1,3))
hist(median_1_expr,xlab="median mutational effect on expression",col="gray50",main=paste(">1 GISAID observations\nP-value",sum(median_1_expr>=median(mutants[nobs>1,expr_avg]))/length(median_1_expr)));abline(v=median(mutants[nobs>1,expr_avg]),lty=2)
hist(max_1_expr,xlab="max mutational effect on expression",col="gray50",main=paste(">1 GISAID observations\nP-value",sum(max_1_expr>=max(mutants[nobs>1,expr_avg]))/length(max_1_expr)));abline(v=max(mutants[nobs>1,expr_avg]),lty=2)
hist(frac_pos_1_expr,xlab="fraction muts with positive effects on expression",col="gray50",main=paste(">1 GISAID observations\nP-value",sum(frac_pos_1_expr>=sum(mutants[nobs>1,expr_avg]>0)/nrow(mutants[nobs>1,]))/length(frac_pos_1_expr)));abline(v=sum(mutants[nobs>1,expr_avg]>0)/nrow(mutants[nobs>1,]),lty=2)
invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/permutation_greater-than-one-GISAID_expr.pdf",sep="")))
```
```{r permute_samples_5_expr, fig.width=10,fig.height=4,fig.align="center", dpi=500,dev="png",echo=TRUE}
n_rep <- 1000000
median_5_expr <- vector(length=n_rep)
max_5_expr <- vector(length=n_rep)
frac_pos_5_expr <- vector(length=n_rep)
for(i in 1:n_rep){
sample <- sample(mutants[singlemut==T & wildtype!=mutant & mutant!="stop" & !is.na(expr_avg),expr_avg],nrow(mutants[nobs>5,]))
median_5_expr[i] <- median(sample)
max_5_expr[i] <- max(sample)
frac_pos_5_expr[i] <- sum(sample>0)/length(sample)
}
par(mfrow=c(1,3))
hist(median_5_expr,xlab="median mutational effect on expression",col="gray50",main=paste(">5 GISAID observations\nP-value",sum(median_5_expr>=median(mutants[nobs>5,expr_avg]))/length(median_5_expr)));abline(v=median(mutants[nobs>5,expr_avg]),lty=2)
hist(max_5_expr,xlab="max mutational effect on expression",col="gray50",main=paste(">5 GISAID observations\nP-value",sum(max_5_expr>=max(mutants[nobs>5,expr_avg]))/length(max_5_expr)));abline(v=max(mutants[nobs>5,expr_avg]),lty=2)
hist(frac_pos_5_expr,xlab="fraction muts with positive effects on expression",col="gray50",main=paste(">5 GISAID observations\nP-value",sum(frac_pos_5_expr>=sum(mutants[nobs>5,expr_avg]>0)/nrow(mutants[nobs>5,]))/length(frac_pos_5_expr)));abline(v=sum(mutants[nobs>5,expr_avg]>0)/nrow(mutants[nobs>5,]),lty=2)
invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/permutation_greater-than-five-GISAID_expr.pdf",sep="")))
```