circulating_variants.Rmd

---
title: "Circulating SARS-CoV-2 RBD variants"
author: "Tyler Starr"
date: "5/12/2020"
output:
  github_document:
    toc: true
    html_preview: false
editor_options: 
  chunk_output_type: inline

---
  
This notebook analyzes RBD variants that have been sampled in isolates within the current SARS-CoV-2 pandemic.

## Setup

```{r setup, message=FALSE, warning=FALSE, error=FALSE}
require("knitr")
knitr::opts_chunk$set(echo = T)
knitr::opts_chunk$set(dev.args = list(png = list(type = "cairo")))

#list of packages to install/load
packages = c("yaml","data.table","tidyverse","gridExtra","bio3d","seqinr")
#install any packages not already installed
installed_packages <- packages %in% rownames(installed.packages())
if(any(installed_packages == F)){
  install.packages(packages[!installed_packages])
}
#load packages
invisible(lapply(packages, library, character.only=T))

#read in config file
config <- read_yaml("config.yaml")

#read in file giving concordance between RBD numbering and SARS-CoV-2 Spike numbering
RBD_sites <- data.table(read.csv(file="data/RBD_sites.csv",stringsAsFactors=F))

#make output directory
if(!file.exists(config$circulating_variants_dir)){
  dir.create(file.path(config$circulating_variants_dir))
}
```
Session info for reproducing environment:
```{r print_sessionInfo}
sessionInfo()
```

Read in tables of variant effects on binding and expression for single mutations to the SARS-CoV-2 RBD.

```{r read_data}

mutants <- data.table(read.csv(file=config$single_mut_effects_file,stringsAsFactors = F))

#rename mutants site indices to prevent shared names with RBD_sites, simplifying some downstream calculations that cross-index these tables
setnames(mutants, "site_RBD", "RBD_site");setnames(mutants, "site_SARS2", "SARS2_site")

```

## Analyzing amino acid diversity in GISAID Spike sequences

We constructed an alignment of all Spike sequences available on GISAID as of 27 May, 2020. On the EpiCoV page, under downloads, one of the pre-made options is a fasta of all Spike sequences isolated thus far, which is updated each day. I have downloaded this file, unzipped, replaced spaces in fasta headers with underscores, and aligned sequences. We load in this alignment using the `read.fasta` function of the `bio3d` package, and trim the alignment to RBD residues. We remove sequecnes from non-human isolates (e.g. bat, pangolin, "environment", mink, cat, TIGER) and sequences with gap `-` characters, and then iterate through the alignment and save any observed mutations. We then filter mutations based on rationale below, and add counts of filtered observations for each mutation as an 'nobs' colum in our overall mutants data table.

We filter out any mutations that were *only* observed on sequences with large numbers of missing `X` characters -- from my initial pass, I saw some singleton amino acid variants which would require >1 nt change, and these were only found in a single sequence with many X amino acid characters (the first half of the sequence was well determined, but the second half was all X's, with the annotated "differences" being within short stretches between Xs with determiined amino acids), which made me realize I needed to be careful not only of sequences rich in gap "-" characters, but also ambiguous "X" characters. However, I didn't want to remove all sequences with undetermined characters off the bat, because another pattern I saw is that for isolates bearing the N439K mutation, >10 are well determined across the entire RBD, but ~80 have many X characters (in part of the RBD that is *not* near the N439K sequence call). So, my preference would be to believe a mutation observed in an X-rich sequence *if the variant in the sequence is observed in at least one variant that does not contain an X*, but not believe mutations that are *only* observed in X-rich sequences. (I noticed this issue with N439K, but this is not the only mutation like this which is observed on 0X sequences at least once but other times on sequences with X characters.) That is the filtering I therefore do below. This is basically the limits of my genomic dataset sleuthing. Is there anything else we should be doing to assess validity of observed mutants, particularly e.g. could N439K simply be a biased sequencing error that emerges in these Scotland sequencing samples? Would love ideas or help in better filtering amino acid variants to retain.


```{r gisaid_spike_alignment}
alignment <- bio3d::read.fasta(file="data/alignments/Spike_GISAID/spike_GISAID_aligned.fasta", rm.dup=T)

#remove non-human samples
keep <- grep("Human",alignment$id);  alignment$ali <- alignment$ali[keep,]; alignment$id <- alignment$id[keep]

#remove columns that are gaps in first reference sequence
alignment$ali <- alignment$ali[,alignment$ali[1,]!="-"]

alignment_RBD <- alignment; alignment_RBD$ali <- alignment$ali[,RBD_sites$site_SARS2]

#check that the first sequence entry matches our reference RBD sequence
stopifnot(sum(!(alignment_RBD$ali[1,] == RBD_sites[,amino_acid_SARS2]))==0)

#remove sequences have gaps, as the amino acid calls may be generally unreliable
remove <- c()
for(i in 1:nrow(alignment_RBD$ali)){
  if(sum(alignment_RBD$ali[i,]=="-") > 0){remove <- c(remove,i)}
}

alignment_RBD$ali <- alignment_RBD$ali[-remove,];alignment_RBD$id <- alignment_RBD$id[-remove]

#output all mutation differences from the WT/reference RBD sequence
#I do this by iterating over rows and columns of the alignment matrix which is STUPID but effective
variants_vec <- c()
isolates_vec <- c()
for(j in 1:ncol(alignment_RBD$ali)){
  #print(i)
  for(i in 1:nrow(alignment_RBD$ali)){
    if(alignment_RBD$ali[i,j] != alignment_RBD$ali[1,j] & !(alignment_RBD$ali[i,j] %in% c("X","-"))){
      variants_vec <- c(variants_vec, paste(alignment_RBD$ali[1,j],j,alignment_RBD$ali[i,j],sep=""))
      isolates_vec <- c(isolates_vec, alignment_RBD$id[i])
    }
  }
}

#remove any mutations that are *only* observed in X-rich sequences of dubious quality (keep counts in X-rich sequences if they were observed in at least one higher quality isolate)
#make a data frame that gives each observed mutation, the isolate it was observed in, and the number of X characters in that sequence. Also, parse the header to give the country/geographic division of the sample
variants <- data.frame(isolate=isolates_vec,mutation=variants_vec)
for(i in 1:nrow(variants)){
  variants$number_X[i] <- sum(alignment_RBD$ali[which(alignment_RBD$id == variants[i,"isolate"]),]=="X")
  variants$geography[i] <- strsplit(as.character(variants$isolate[i]),split="/")[[1]][2]
}
#filter the sequence set for mutations observed in at least one X=0 background
variants_filtered <- data.frame(mutation=unique(variants[variants$number_X==0,"mutation"])) #only keep variants observed in at least one sequence with 0 X
for(i in 1:nrow(variants_filtered)){
  variants_filtered$n_obs[i] <- sum(variants$mutation == variants_filtered$mutation[i]) #but keep counts for any sequence with observed filtered muts
  variants_filtered$n_geography[i] <- length(unique(variants[variants$mutation == variants_filtered$mutation[i],"geography"]))
  variants_filtered$list_geography[i] <- list(list(unique(variants[variants$mutation == variants_filtered$mutation[i],"geography"])))
}

#add count to mutants df
mutants[,nobs:=0]
mutants[,ngeo:=0]
mutants[,geo_list:=as.list(NA)]
for(i in 1:nrow(mutants)){
  if(mutants$mutation_RBD[i] %in% variants_filtered$mutation){
    mutants$nobs[i] <- variants_filtered[variants_filtered$mutation==mutants$mutation_RBD[i],"n_obs"]
    mutants$ngeo[i] <- variants_filtered[variants_filtered$mutation==mutants$mutation_RBD[i],"n_geography"]
    mutants$geo_list[i] <- variants_filtered[variants_filtered$mutation==mutants$mutation_RBD[i],"list_geography"]
  }
}


```

We see `r sum(mutants$nobs)` amino acid polymorphisims within the `r nrow(alignment_RBD$ali)` sequences uploaded in GISAID, which represents `r sum(mutants$nobs>0)` of our `r nrow(mutants[mutant!=wildtype & mutant!="*",])` measured missense mutants. In the table below, we can see that many of these mutations are observed only one or a few times, so there may still be unaccounted for sequencinig artifacts, which we tried to account for at least minimally with some filtering above.

```{r table_circulating_variants_nobs}
kable(table(mutants[mutant!=wildtype & mutant!="*",nobs]),col.names=c("mutation count","frequency"))
```

We plot each mutations experimental phenotype versus the number of times it is observed in the circulating Spike alignment, for binding (top) and expression (bottom), with the righthand plots simply zooming in on the region surrounding zero for better visualization. We can see that some of the mutations that are observed just one or a couple of times are highly deleterious, and anything sampled more than a handful of times exhibits ~neutral or perhaps small positive binding and/or expression effects.

```{r scatter_circulating_variants_nobs, fig.width=8,fig.height=8.5,fig.align="center", dpi=500,dev="png",echo=FALSE}
par(mfrow=c(2,2))
plot(mutants[nobs>0,nobs],mutants[nobs>0,bind_avg],pch=16,cex=1.2,col="#00000050",xlab="Number of observations among GISAID Spikes",ylab="delta log10(Ka,app)");abline(h=0,lty=2)

plot(mutants[nobs>0,nobs],mutants[nobs>0,bind_avg],pch=16,cex=1.2,col="#00000050",xlab="Number of observations among GISAID Spikes",ylab="delta log10(Ka,app)",ylim=c(-0.5,0.1));abline(h=0,lty=2)

plot(mutants[nobs>0,nobs],mutants[nobs>0,expr_avg],pch=16,cex=1.2,col="#00000050",xlab="Number of observations among GISAID Spikes",ylab="delta expression meanF");abline(h=0,lty=2)

plot(mutants[nobs>0,nobs],mutants[nobs>0,expr_avg],pch=16,cex=1.2,col="#00000050",xlab="Number of observations among GISAID Spikes",ylab="delta expression meanF",ylim=c(-1,1));abline(h=0,lty=2)

invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/phenotype_v_GISAID-nobs.pdf",sep="")))
```

We also make plots showing mutational effects versus the number of geographic regions in which a mutation has been observed.

```{r scatter_circulating_variants_ngeography, fig.width=8,fig.height=8.5,fig.align="center", dpi=500,dev="png",echo=FALSE}
par(mfrow=c(2,2))
plot(mutants[nobs>0,ngeo],mutants[nobs>0,bind_avg],pch=16,cex=1.2,col="#00000050",xlab="Number of geographical observations",ylab="delta log10(Ka,app)");abline(h=0,lty=2)

plot(mutants[nobs>0,ngeo],mutants[nobs>0,bind_avg],pch=16,cex=1.2,col="#00000050",xlab="Number of geographical observations",ylab="delta log10(Ka,app)",ylim=c(-0.5,0.1));abline(h=0,lty=2)

plot(mutants[nobs>0,ngeo],mutants[nobs>0,expr_avg],pch=16,cex=1.2,col="#00000050",xlab="Number of geographical observations",ylab="delta expression meanF");abline(h=0,lty=2)

plot(mutants[nobs>0,ngeo],mutants[nobs>0,expr_avg],pch=16,cex=1.2,col="#00000050",xlab="Number of geographical observations",ylab="delta expression meanF",ylim=c(-1,1));abline(h=0,lty=2)

invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/phenotype_v_GISAID-nobs.pdf",sep="")))
```

Here are tables giving mutations that were seen >20 times, and those seen any number of times with measured binding effects >0.05. We are currently slotted to validate the effect of N439K in both yeast display and pseudovirus/mammalian experimental assays, and V367F, T478I, and V483A in yeast display. (S477N just came online as being prevalent with the newest GISAID set of sequences we used, after we had started cloning for validations.)

```{r table_most_common_variants, echo=F}
kable(mutants[nobs>20,.(mutation,expr_lib1,expr_lib2,expr_avg,bind_lib1,bind_lib2,bind_avg,nobs,ngeo)],
      col.names=c("Mutation","expr, lib1","expr, lib2","expression effect","bind, lib1","bind, lib2", "binding effect","number of GISAID sequences", "number locations"))
```

```{r table_highest_binding_variants, echo=F}
kable(mutants[nobs>0 & bind_avg>0.05,.(mutation,expr_lib1,expr_lib2,expr_avg,bind_lib1,bind_lib2,bind_avg,nobs,ngeo)],
      col.names=c("Mutation","expr, lib1","expr, lib2","expression effect","bind, lib1","bind, lib2", "binding effect","number of GISAID sequences", "number locations"))
```

Let's visualize the positions with interesting circulating variants in our *favorite* exploratory heatmaps! Below, we output maps first for positions with circulating variants observed >20 times (left two maps), and second for those positions with circulating variants of at least >0.05 effect on binding (right two maps). These maps show the SARS-CoV-2 wildtype state with an "x" indicator, the SARS-CoV-1 state with an "o", and "^" marks any amino acid variants observed at least one time in the GISAID sequences.

```{r heatmap_circulating_variants, fig.width=8,fig.height=4,fig.align="center", dpi=500,dev="png",echo=FALSE}
#order mutant as a factor for grouping by rough biochemical grouping
mutants$mutant <- factor(mutants$mutant, levels=c("*","C","P","G","V","M","L","I","A","F","W","Y","T","S","N","Q","E","D","H","K","R"))
#add character vector indicating wildtype to use as plotting symbols for wt
mutants[,wildtype_indicator := ""]
mutants[mutant==wildtype,wildtype_indicator := "x"]
#indicator for wildtype SARS-CoV-1 state
mutants[,SARS1_indicator := ""]
for(i in 1:nrow(mutants)){
  SARS1aa <- RBD_sites[site_SARS2==mutants$SARS2_site[i],amino_acid_SARS1]
  if(!is.na(SARS1aa) & mutants$mutant[i] == SARS1aa){mutants$SARS1_indicator[i] <- "o"}
}
#add indicator "^" for observed variant amino acids
mutants[,variant_indicator:=""]
mutants[nobs>0,variant_indicator:="^"]


#make smaller data frame of positions with "high" frequency variants
muts_temp <- mutants[SARS2_site %in% mutants[nobs>20,SARS2_site],]; muts_temp$SARS2_site <- as.factor(muts_temp$SARS2_site)

p1 <- ggplot(muts_temp,aes(SARS2_site,mutant))+geom_tile(aes(fill=bind_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,0.5),values=c(0,2.5/5.5,5/5.5,5.25/5.5,5.5/5.5),na.value="gray")+
  scale_x_discrete(expand=c(0,0))+
  labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")+
  geom_text(aes(label=variant_indicator),size=2,color="gray10")+
  geom_text(aes(label=SARS1_indicator),size=2,color="gray10")

p2 <- ggplot(muts_temp,aes(SARS2_site,mutant))+geom_tile(aes(fill=expr_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,1),values=c(0,1/6,3/6,5/6,5.5/6,6/6),na.value="gray")+
  scale_x_discrete(expand=c(0,0))+
  labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")+
  geom_text(aes(label=variant_indicator),size=2,color="gray10")+
  geom_text(aes(label=SARS1_indicator),size=2,color="gray10")

#make smaller data frame of positions with "high" frequency variants
muts_temp <- mutants[SARS2_site %in% mutants[nobs>0 & bind_avg > 0.05,SARS2_site],]; muts_temp$SARS2_site <- as.factor(muts_temp$SARS2_site)

p3 <- ggplot(muts_temp,aes(SARS2_site,mutant))+geom_tile(aes(fill=bind_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,0.5),values=c(0,2.5/5.5,5/5.5,5.25/5.5,5.5/5.5),na.value="gray")+
  scale_x_discrete(expand=c(0,0))+
  labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")+
  geom_text(aes(label=variant_indicator),size=2,color="gray10")+
  geom_text(aes(label=SARS1_indicator),size=2,color="gray10")

p4 <- ggplot(muts_temp,aes(SARS2_site,mutant))+geom_tile(aes(fill=expr_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,1),values=c(0,1/6,3/6,5/6,5.5/6,6/6),na.value="gray")+
  scale_x_discrete(expand=c(0,0))+
  labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")+
  geom_text(aes(label=variant_indicator),size=2,color="gray10")+
  geom_text(aes(label=SARS1_indicator),size=2,color="gray10")


grid.arrange(p1,p2,p3,p4,ncol=4,widths=c(1,1,1.15,1.15))

invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/heatmaps_circulating_variants.pdf",sep="")))
```

## Strength of selection among circulating variants

To characterize the effect of selection, we can compare the distribution of functional effects of mutations at different n_obs cutoffs, compared to the "raw" distribution of functional effects -- in this case, we should look at the DFE of only those amino acid mutations that can be introduced with single nucleotide mutations given the SARS-CoV-2 reference nt sequence.

First, we add a column to our mutants data frame indicating whether a mutation is accessible by single nucleotide mutations.

```{r single_codon_muts}
#define a function that takes a character of three nucleotides (a codon), and outputs all amino acids that can be accessed via single-nt mutation of that codon
get.codon.muts <- function(codon){
  nt <- c("a","c","g","t")
  codon_split <- strsplit(codon,split="")[[1]]
  codon_muts <- vector()
  for(i in nt[nt!=codon_split[1]]){
    codon_muts <- c(codon_muts,seqinr::translate(c(i,codon_split[2:3])))
  }
  for(i in nt[nt!=codon_split[2]]){
    codon_muts <- c(codon_muts,seqinr::translate(c(codon_split[1],i,codon_split[3])))
  }
  for(i in nt[nt!=codon_split[3]]){
    codon_muts <- c(codon_muts,seqinr::translate(c(codon_split[1:2],i)))
  }
  return(codon_muts)
}

mutants[,SARS2_codon:=RBD_sites[site_SARS2==SARS2_site,codon_SARS2],by=mutation]
mutants[,singlemut := mutant %in% get.codon.muts(SARS2_codon),by=mutation]
```

Are any of our observed GISAID mutations >1nt changes? In the current alignment, no!

```{r table_GISAID_multimuts}
kable(mutants[singlemut==F & nobs>0,.(mutation_RBD,mutation,expr_lib1,expr_lib2,expr_avg,bind_lib1,bind_lib2,bind_avg,nobs,SARS2_codon)])
```


Below is a heatmap of binding effects for all mutations accessible in single nucleotide changes from the SARS-CoV-2 WT reference sequence, with others grayed out. We can see that several affinity-enhancing mutations (including at least two that we are currently planning to validiate), are accessible with single-nt mutations. Therefore, affinity-enhancing mutations, if selectively beneficial, would be readily accessible via mutation. However, there are probably some positions where beneficial mutations are possible, but none are available from single-nt mutations which we could dig into if interesting.

```{r heatmap_binding_single_muts, fig.width=12,fig.height=6,fig.align="center", dpi=500,dev="png",echo=FALSE}
muts_temp <- copy(mutants)
muts_temp[singlemut==F,bind_avg:=NA]
muts_temp[singlemut==F,expr_avg:=NA]

p1 <- ggplot(muts_temp[mutant!="*" & SARS2_site %in% seq(331,431),],aes(SARS2_site,mutant))+geom_tile(aes(fill=bind_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,0.5),values=c(0,2.5/5.5,5/5.5,5.25/5.5,5.5/5.5),na.value="gray")+
  scale_x_continuous(expand=c(0,0),breaks=c(331,seq(340,430,by=5)))+
  labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")

p2 <- ggplot(muts_temp[mutant!="*" & SARS2_site %in% seq(432,531),],aes(SARS2_site,mutant))+geom_tile(aes(fill=bind_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,0.5),values=c(0,2.5/5.5,5/5.5,5.25/5.5,5.5/5.5),na.value="gray")+
  scale_x_continuous(expand=c(0,0),breaks=c(432,seq(440,530,by=5)))+
  labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")

grid.arrange(p1,p2,ncol=1)

invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/single-nt-mut_heatmap.pdf",sep="")))
```

As has been seen in other studies, the genetic code is structured in a conservative way such that neighboring codons exhibit similiar biochemical properties, which causes single-nt amino acid changes to have less deleterious effects than multiple-nt-mutant codon changes. We see this trend in our data as well, further illustrating why we should compare circulating variants to the single-nt-mutant amino acid effects as the "raw" distribution of functional effects. (Median mutational effect of single-nt mutants = `r median(mutants[mutant!=wildtype & singlemut==T,bind_avg],na.rm=T)`; median mutational effect of all amino acid muts = `r median(mutants[mutant!=wildtype,bind_avg],na.rm=T)`; P-value `r round(wilcox.test(mutants[mutant!=wildtype & singlemut==T,bind_avg],mutants[mutant!=wildtype,bind_avg])$p.value,digits=5)`, Wilcoxon rank-sum test.)

To illustrate how selection acts on circulating variants, let's compare the functional effects of mutations, binned by number of observations in GISAID. The violin plots below show the distribution of functional effects on binding (left) and expression (right), comparing single-nt amino acid mutations with 0 observed counts in GISAID versus increasingly stringent GISAID count cutoffs. We can see a bias among circulating mutants for both binding and expression effects that are visually by eye higher than expected by random mutation alone. This suggests that purifying selection is removing deleterious RBD mutations that affect traits correlated with our measured binding and expression phenotypes. We may also see evidence that there is not strong positive selection for enhanced ACE2-binding affinity, as there are single mutations that can cause larger affinity increases than are actually seen in the GISAID set. We can evaluate these two conclusions further below with permutation tests.

```{r circulating_variant_DFEs, fig.width=8,fig.height=4,fig.align="center", dpi=500,dev="png",echo=FALSE}
#define factor for different nobs cutoffs, and collate into long data table for plotting
muts_temp <- mutants[singlemut==TRUE & wildtype!=mutant & mutant!="*", ]
muts_temp[,nobs_indicator:=as.factor("all single nt muts")]
muts_temp_add0 <- mutants[nobs>=1 & singlemut==TRUE & wildtype!=mutant & mutant!="*", ]
muts_temp_add0[,nobs_indicator:=as.factor(">=1")]
muts_temp_add1 <- mutants[nobs>=2 & singlemut==TRUE & wildtype!=mutant & mutant!="*", ]
muts_temp_add1[,nobs_indicator:=as.factor(">=2")]
muts_temp_add5 <- mutants[nobs>=6 & singlemut==TRUE & wildtype!=mutant & mutant!="*", ]
muts_temp_add5[,nobs_indicator:=as.factor(">=6")]

muts_temp <- rbind(muts_temp,muts_temp_add0,muts_temp_add1,muts_temp_add5)

set.seed(198)
p1 <- ggplot(muts_temp[!is.na(bind_avg),],aes(x=nobs_indicator,y=bind_avg))+
  geom_boxplot(outlier.shape=16, width=0.4, outlier.alpha=0.5)+
  #geom_jitter(width=0.2, alpha=0.1, shape=16)+
  xlab("mutation count in GISAID sequences")+ylab("mutation effect on binding")+
  theme_classic()

p2 <- ggplot(muts_temp[!is.na(nobs_indicator) & !is.na(expr_avg),],aes(x=nobs_indicator,y=expr_avg))+
  geom_boxplot(outlier.shape=16, width=0.4, outlier.alpha=0.5)+
  #geom_jitter(width=0.2, alpha=0.1, shape=16)+
  xlab("mutation count in GISAID sequences")+ylab("mutation effect on expression")+
  theme_classic()

grid.arrange(p1,p2,ncol=2)

invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/distribution-binding_v_nobs-GISAID.pdf",sep="")))
```

Let's use permutation tests to evaluate whether the shift in median mutational effects among GISAID sequences of different minimum frequency cutoffs is significant relative to the overall distribution of single-mutant effects. We draw samples without replacement from the raw distribution of mutational effects (single-nt mutants only), drawing the same number of random mutations as in our GISAID sets. Within each random sample, we determine the median mutational effect, as well as the highest mutational effect and the fraction of mutations that have binding efects >0. We then compare our actual values to these random samples to evaluate the biases in mutational effects among mutations observed in the GISAID sequences.

The plots below show the distribution of a statistic in our 1e6 sampled sets, and the dashed line shows the actual value in our set of mutants observed in GISAID sequences some number of time -- the first set of plots for mutations observed 1 or more times in GISAID sequences, the second set of plots for mutations observed more than 1 time, and the third set of plots for mutations observed more than 5 times. We evaluate a P-value for each comparison by determining the fraction of subsampled replicates with a value equal to or greater than the actual value. We can see that the median effect of mutations is significantly higher in the GISAID mutation sets than random samples, indicating that purifying selection shapes the mutations that are sampled in GISAID. We do not see strong evidence for positive selection for affinity-enhancing mutations among circulating variants -- a large fraction of randomly sub-sampled sets of mutations sample at least one mutation with a higher affinity-enhancing effect than any seen in our actual dataset, and the fraction of mutations with binding effect >0, though significantly lower in sub-samples, is not strongly diverged, and is conflated by the purifying selection which wiill naturally increase this metric among the observed set of mutants. 

```{r permute_samples_0, fig.width=12,fig.height=4,fig.align="center", dpi=500,dev="png",echo=TRUE}
set.seed(18041)
n_rep <- 1000000
median_0 <- vector(length=n_rep)
max_0 <- vector(length=n_rep)
frac_pos_0 <- vector(length=n_rep)

for(i in 1:n_rep){
  sample <- sample(mutants[singlemut==T & wildtype!=mutant & mutant!="stop" & !is.na(bind_avg),bind_avg],nrow(mutants[nobs>0,]))
  median_0[i] <- median(sample)
  max_0[i] <- max(sample)
  frac_pos_0[i] <- sum(sample>0)/length(sample)
}

par(mfrow=c(1,3))
hist(median_0,xlab="median mutational effect on binding",col="gray50",main=paste(">0 GISAID observations\nP-value",sum(median_0>=median(mutants[nobs>0,bind_avg]))/length(median_0)));abline(v=median(mutants[nobs>0,bind_avg]),lty=2)

hist(max_0,xlab="max mutational effect on binding",col="gray50",main=paste(">0 GISAID observations\nP-value",sum(max_0>=max(mutants[nobs>0,bind_avg]))/length(max_0)));abline(v=max(mutants[nobs>0,bind_avg]),lty=2)

hist(frac_pos_0,xlab="fraction muts with positive effects on binding",col="gray50",main=paste(">0 GISAID observations\nP-value",sum(frac_pos_0>=sum(mutants[nobs>0,bind_avg]>0)/nrow(mutants[nobs>0,]))/length(frac_pos_0)));abline(v=sum(mutants[nobs>0,bind_avg]>0)/nrow(mutants[nobs>0,]),lty=2)

invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/permutation_greater-than-zero-GISAID.pdf",sep="")))
```

```{r permute_samples_1, fig.width=12,fig.height=4,fig.align="center", dpi=500,dev="png",echo=TRUE}
n_rep <- 1000000
median_1 <- vector(length=n_rep)
max_1 <- vector(length=n_rep)
frac_pos_1 <- vector(length=n_rep)

for(i in 1:n_rep){
  sample <- sample(mutants[singlemut==T & wildtype!=mutant & mutant!="stop" & !is.na(bind_avg),bind_avg],nrow(mutants[nobs>1,]))
  median_1[i] <- median(sample)
  max_1[i] <- max(sample)
  frac_pos_1[i] <- sum(sample>0)/length(sample)
}

par(mfrow=c(1,3))
hist(median_1,xlab="median mutational effect on binding",col="gray50",main=paste(">1 GISAID observations\nP-value",sum(median_1>=median(mutants[nobs>1,bind_avg]))/length(median_1)));abline(v=median(mutants[nobs>1,bind_avg]),lty=2)

hist(max_1,xlab="max mutational effect on binding",col="gray50",main=paste(">1 GISAID observations\nP-value",sum(max_1>=max(mutants[nobs>1,bind_avg]))/length(max_1)));abline(v=max(mutants[nobs>1,bind_avg]),lty=2)

hist(frac_pos_1,xlab="fraction muts with positive effects on binding",col="gray50",main=paste(">1 GISAID observations\nP-value",sum(frac_pos_1>=sum(mutants[nobs>1,bind_avg]>0)/nrow(mutants[nobs>1,]))/length(frac_pos_1)));abline(v=sum(mutants[nobs>1,bind_avg]>0)/nrow(mutants[nobs>1,]),lty=2)

invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/permutation_greater-than-one-GISAID.pdf",sep="")))
```

```{r permute_samples_5, fig.width=10,fig.height=4,fig.align="center", dpi=500,dev="png",echo=TRUE}
n_rep <- 1000000
median_5 <- vector(length=n_rep)
max_5 <- vector(length=n_rep)
frac_pos_5 <- vector(length=n_rep)

for(i in 1:n_rep){
  sample <- sample(mutants[singlemut==T & wildtype!=mutant & mutant!="stop" & !is.na(bind_avg),bind_avg],nrow(mutants[nobs>5,]))
  median_5[i] <- median(sample)
  max_5[i] <- max(sample)
  frac_pos_5[i] <- sum(sample>0)/length(sample)
}

par(mfrow=c(1,3))
hist(median_5,xlab="median mutational effect on binding",col="gray50",main=paste(">5 GISAID observations\nP-value",sum(median_5>=median(mutants[nobs>5,bind_avg]))/length(median_5)));abline(v=median(mutants[nobs>5,bind_avg]),lty=2)

hist(max_5,xlab="max mutational effect on binding",col="gray50",main=paste(">5 GISAID observations\nP-value",sum(max_5>=max(mutants[nobs>5,bind_avg]))/length(max_5)));abline(v=max(mutants[nobs>5,bind_avg]),lty=2)

hist(frac_pos_5,xlab="fraction muts with positive effects on binding",col="gray50",main=paste(">5 GISAID observations\nP-value",sum(frac_pos_5>=sum(mutants[nobs>5,bind_avg]>0)/nrow(mutants[nobs>5,]))/length(frac_pos_5)));abline(v=sum(mutants[nobs>5,bind_avg]>0)/nrow(mutants[nobs>5,]),lty=2)

invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/permutation_greater-than-five-GISAID.pdf",sep="")))
```

```{r permute_samples_0_expr, fig.width=12,fig.height=4,fig.align="center", dpi=500,dev="png",echo=TRUE}
set.seed(9248)
n_rep <- 1000000
median_0_expr <- vector(length=n_rep)
max_0_expr <- vector(length=n_rep)
frac_pos_0_expr <- vector(length=n_rep)

for(i in 1:n_rep){
  sample <- sample(mutants[singlemut==T & wildtype!=mutant & mutant!="stop" & !is.na(expr_avg),expr_avg],nrow(mutants[nobs>0,]))
  median_0_expr[i] <- median(sample)
  max_0_expr[i] <- max(sample)
  frac_pos_0_expr[i] <- sum(sample>0)/length(sample)
}

par(mfrow=c(1,3))
hist(median_0_expr,xlab="median mutational effect on expression",col="gray50",main=paste(">0 GISAID observations\nP-value",sum(median_0_expr>=median(mutants[nobs>0,expr_avg]))/length(median_0_expr)));abline(v=median(mutants[nobs>0,expr_avg]),lty=2)

hist(max_0_expr,xlab="max mutational effect on expression",col="gray50",main=paste(">0 GISAID observations\nP-value",sum(max_0_expr>=max(mutants[nobs>0,expr_avg]))/length(max_0_expr)));abline(v=max(mutants[nobs>0,expr_avg]),lty=2)

hist(frac_pos_0_expr,xlab="fraction muts with positive effects on expression",col="gray50",main=paste(">0 GISAID observations\nP-value",sum(frac_pos_0_expr>=sum(mutants[nobs>0,expr_avg]>0)/nrow(mutants[nobs>0,]))/length(frac_pos_0_expr)));abline(v=sum(mutants[nobs>0,expr_avg]>0)/nrow(mutants[nobs>0,]),lty=2)

invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/permutation_greater-than-zero-GISAID_expr.pdf",sep="")))
```

```{r permute_samples_1_expr, fig.width=12,fig.height=4,fig.align="center", dpi=500,dev="png",echo=TRUE}
n_rep <- 1000000
median_1_expr <- vector(length=n_rep)
max_1_expr <- vector(length=n_rep)
frac_pos_1_expr <- vector(length=n_rep)

for(i in 1:n_rep){
  sample <- sample(mutants[singlemut==T & wildtype!=mutant & mutant!="stop" & !is.na(expr_avg),expr_avg],nrow(mutants[nobs>1,]))
  median_1_expr[i] <- median(sample)
  max_1_expr[i] <- max(sample)
  frac_pos_1_expr[i] <- sum(sample>0)/length(sample)
}

par(mfrow=c(1,3))
hist(median_1_expr,xlab="median mutational effect on expression",col="gray50",main=paste(">1 GISAID observations\nP-value",sum(median_1_expr>=median(mutants[nobs>1,expr_avg]))/length(median_1_expr)));abline(v=median(mutants[nobs>1,expr_avg]),lty=2)

hist(max_1_expr,xlab="max mutational effect on expression",col="gray50",main=paste(">1 GISAID observations\nP-value",sum(max_1_expr>=max(mutants[nobs>1,expr_avg]))/length(max_1_expr)));abline(v=max(mutants[nobs>1,expr_avg]),lty=2)

hist(frac_pos_1_expr,xlab="fraction muts with positive effects on expression",col="gray50",main=paste(">1 GISAID observations\nP-value",sum(frac_pos_1_expr>=sum(mutants[nobs>1,expr_avg]>0)/nrow(mutants[nobs>1,]))/length(frac_pos_1_expr)));abline(v=sum(mutants[nobs>1,expr_avg]>0)/nrow(mutants[nobs>1,]),lty=2)

invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/permutation_greater-than-one-GISAID_expr.pdf",sep="")))
```

```{r permute_samples_5_expr, fig.width=10,fig.height=4,fig.align="center", dpi=500,dev="png",echo=TRUE}
n_rep <- 1000000
median_5_expr <- vector(length=n_rep)
max_5_expr <- vector(length=n_rep)
frac_pos_5_expr <- vector(length=n_rep)

for(i in 1:n_rep){
  sample <- sample(mutants[singlemut==T & wildtype!=mutant & mutant!="stop" & !is.na(expr_avg),expr_avg],nrow(mutants[nobs>5,]))
  median_5_expr[i] <- median(sample)
  max_5_expr[i] <- max(sample)
  frac_pos_5_expr[i] <- sum(sample>0)/length(sample)
}

par(mfrow=c(1,3))
hist(median_5_expr,xlab="median mutational effect on expression",col="gray50",main=paste(">5 GISAID observations\nP-value",sum(median_5_expr>=median(mutants[nobs>5,expr_avg]))/length(median_5_expr)));abline(v=median(mutants[nobs>5,expr_avg]),lty=2)

hist(max_5_expr,xlab="max mutational effect on expression",col="gray50",main=paste(">5 GISAID observations\nP-value",sum(max_5_expr>=max(mutants[nobs>5,expr_avg]))/length(max_5_expr)));abline(v=max(mutants[nobs>5,expr_avg]),lty=2)

hist(frac_pos_5_expr,xlab="fraction muts with positive effects on expression",col="gray50",main=paste(">5 GISAID observations\nP-value",sum(frac_pos_5_expr>=sum(mutants[nobs>5,expr_avg]>0)/nrow(mutants[nobs>5,]))/length(frac_pos_5_expr)));abline(v=sum(mutants[nobs>5,expr_avg]>0)/nrow(mutants[nobs>5,]),lty=2)

invisible(dev.print(pdf, paste(config$circulating_variants_dir,"/permutation_greater-than-five-GISAID_expr.pdf",sep="")))
```