structure_function.Rmd

---
title: "Structure function analysis of mutational effects"
author: "Tyler Starr"
date: "5/11/2020"
output:
  github_document:
    toc: true
    html_preview: false
editor_options: 
  chunk_output_type: inline

---
  
This notebook analyzes our single mutant effects on binding and expression in light of structural features of the RBD.

```{r setup, message=FALSE, warning=FALSE, error=FALSE}
require("knitr")
knitr::opts_chunk$set(echo = T)
knitr::opts_chunk$set(dev.args = list(png = list(type = "cairo")))

#list of packages to install/load
packages = c("yaml","data.table","tidyverse","bio3d","gridExtra","egg","ggridges")
#install any packages not already installed
installed_packages <- packages %in% rownames(installed.packages())
if(any(installed_packages == F)){
  install.packages(packages[!installed_packages])
}
#load packages
invisible(lapply(packages, library, character.only=T))

#read in config file
config <- read_yaml("config.yaml")

#read in file giving concordance between RBD numbering and SARS-CoV-2 Spike numbering
RBD_sites <- data.table(read.csv(file="data/RBD_sites.csv",stringsAsFactors=F))

#make output directories
if(!file.exists(config$structure_function_dir)){
  dir.create(file.path(config$structure_function_dir))
}

```

Session info for reproducing environment:
```{r print_sessionInfo}
sessionInfo()
```

## Setup

Read in tables of variant effects on binding and expression, for single mutations to the SARS-CoV-2 RBD and for a panel of homolog RBDs from the sarbecovirus clade.

```{r read_data}
homologs <- data.table(read.csv(file=config$homolog_effects_file,stringsAsFactors = F))
mutants <- data.table(read.csv(file=config$single_mut_effects_file,stringsAsFactors = F))

#rename mutants site indices to prevent shared names with RBD_sites, simplifying some downstream calculations that cross-index these tables
setnames(mutants, "site_RBD", "RBD_site");setnames(mutants, "site_SARS2", "SARS2_site")

#add color column to homologs, by clade
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
homologs$clade_color <- as.character(NA); homologs[clade=="Clade 1",clade_color := cbPalette[4]]; homologs[clade=="Clade 2",clade_color := cbPalette[2]]; homologs[clade=="Clade 3",clade_color := cbPalette[8]]; homologs[clade=="SARS-CoV-2",clade_color := cbPalette[6]]
#add plotting character to homologs, by clade
homologs$clade_pch <- as.numeric(NA); homologs[clade=="Clade 1",clade_pch := 15]; homologs[clade=="Clade 2",clade_pch := 17]; homologs[clade=="Clade 3",clade_pch := 18]; homologs[clade=="SARS-CoV-2",clade_pch  := 16]
```

## General structural constraints on RBD affinity and stability

Let's investigate how the RBD structure influences mutational effects on expression and binding. First, we compute the *mean* effect of mutations on binding and expression for each RBD site, as well as the best (max) and worst (min) mutational effects on these two measurements (excluding nonsense and synonymous mutants). To me, the mean captures overall constraint on a position, but min and max can add some extra context -- in particular the max effect (the "best" mutation that can be introduced) can distinguish between two positions with a strong negative *average* effect of mutations, though one position might tolerate *some* amino acid mutations without defect, while another might not tolerate *any* of the possible 19 amino acids. Particularly when thinking about antibody epitopes, this max parameter might be a useful indicator of extreme constraint on some positions (since "escape" from an antibody only requires that at least one amino acid mutation is tolerated, not the average of all 19).

```{r mean_max_mut_effect_per_site}
RBD_sites[,mean_bind := mean(mutants[SARS2_site==site_SARS2 & wildtype != mutant & mutant != "*",bind_avg],na.rm=T),by=site_SARS2]
RBD_sites[,max_bind := max(mutants[SARS2_site==site_SARS2 & wildtype != mutant & mutant != "*",bind_avg],na.rm=T),by=site_SARS2]
RBD_sites[,min_bind := min(mutants[SARS2_site==site_SARS2 & wildtype != mutant & mutant != "*",bind_avg],na.rm=T),by=site_SARS2]

RBD_sites[,mean_expr := mean(mutants[SARS2_site==site_SARS2 & wildtype != mutant & mutant != "*",expr_avg],na.rm=T),by=site_SARS2]
RBD_sites[,max_expr := max(mutants[SARS2_site==site_SARS2 & wildtype != mutant & mutant != "*",expr_avg],na.rm=T),by=site_SARS2]
RBD_sites[,min_expr := min(mutants[SARS2_site==site_SARS2 & wildtype != mutant & mutant != "*",expr_avg],na.rm=T),by=site_SARS2]
```

First, let's see how mutational effects on binding and expression correlate at the level of individual mutations and at the site-level mean effects of mutation. We can see below that for many mutations and sites, mutational effects on expression and binding are related, indicating that stability is a generic constraint on mutational effects on ACE2-binding. However, there are a handful of positions that deviate from this pattern, being tolerant to mutation with respect to expression despite being quite sensitive to mutation with respect to binding. Below, we output the sites of these mutations, which we can visualize on the ACE2-bound RBD structure using `dms-view`, linked [here for sites](https://dms-view.github.io/?pdb-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2F6m0j.pdb&markdown-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2FBloomLab_rbd.md&data-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2Fresults%2FBloomLab2020_rbd.csv&condition=natural+frequencies&site_metric=site_entropy&mutation_metric=mut_frequency&selected_sites=447%2C449%2C456%2C473%2C476%2C487%2C489%2C496%2C500%2C502%2C505) and [here for mutations](https://dms-view.github.io/?pdb-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2F6m0j.pdb&markdown-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2FBloomLab_rbd.md&data-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2Fresults%2FBloomLab2020_rbd.csv&condition=natural+frequencies&site_metric=site_entropy&mutation_metric=mut_frequency&selected_sites=443%2C455%2C456%2C475%2C487%2C489%2C496%2C498%2C500%2C501%2C502%2C505) that fall in this binding-specific defective class. We can see that these sites exhibiting binding-specific mutational sensitivity are at the ACE2-contact interface, or in the case of one mutation (S443N), perhaps second shell posititions that are still ACE2-proximal. This is consistent with these positions having binding constraints independent of stability because of their direct interaction with ACE2.

```{r correlation_mut_expression_binding, fig.width=8,fig.height=4.25,fig.align="center", dpi=300,dev="png"}
par(mfrow=c(1,2))
plot(RBD_sites$mean_expr,RBD_sites$mean_bind,pch=16,col="#00000080",xlab="mean mutational effect on expression",ylab="mean mutational effect on binding",main="binding versus expression effects,\naverage per site")

plot(mutants$expr_avg,mutants$bind_avg,pch=16,col="#00000080",xlab="mutational effect on expression",ylab="mutational effect on binding",main="binding versus expression effects,\nper mutant")

invisible(dev.print(pdf, paste(config$structure_function_dir,"/correlation_expression_v_binding.pdf",sep="")))

#output sites and mutations with seemingly binding-specific detrimental effects
RBD_sites[mean_expr > -1 & mean_bind < -1,site_SARS2]
mutants[expr_avg > -0.5 & bind_avg < -2,mutation]
```
The relative solvent accessibility (RSA) of an amino acid residue is known to be a dominant factor influencing its tolerance to mutation. Let's see how RSA of a position is related to its mutational sensitivity for binding. We use RSA in two different structural contexts -- the free RBD structure, and the RBD structure when complexed with ACE2. We can see that mutational sensitivity of a position with respect to binding is better described by RSA in the ACE2-bound RBD complex. This explains our observation above -- some positions are mutationally sensitive, because they are buried in the isolated RBD, and so mutations destabilize the core fold and thereby hamper binding, while others are mutationally sensitive not because they are buried in the core fold and are sensitive to destabilizing mutations, but because they are buried at the ACE2 interface. These two factors combine to explain our overall observed patterns of mutational tolerance.

```{r mean_mut_effect_versus_RSA, fig.width=9,fig.height=3.1,fig.align="center", dpi=300,dev="png", echo=F}
par(mfrow=c(1,3))
plot(RBD_sites$RSA_unbound,RBD_sites$mean_bind,pch=16,col="gray80",xlab="Residue relative solvent accessibility, unbound",ylab="Mean effect of mutation on binding",main="Unbound RBD",xlim=c(0,1))

plot(RBD_sites$RSA_bound,RBD_sites$mean_bind,pch=16,col="#00000080",xlab="Residue relative solvent accessibility, ACE2-bound",ylab="Mean effect of mutation on binding",main="ACE2-bound RBD")


plot(RBD_sites$RSA_unbound,RBD_sites$mean_bind,pch=16,col="gray80",xlab="Residue relative solvent accessibility",ylab="Mean effect of mutation on binding",main="overlaid")
points(RBD_sites$RSA_bound,RBD_sites$mean_bind,pch=16, col="gray20")

#save pdf
invisible(dev.print(pdf, paste(config$structure_function_dir,"/mean-bind_v_RSA.pdf",sep=""), useDingbats=F))

```

Next, we want to investigate how mutational tolerance differs between the core alpha+beta RBD fold versus the 'Receptor Binding Motif' (RBM) loops. In particular, I want to test the hypothesis that the core RBD is more constrained with regards to mutational effects on expression, while the RBM is constrained with regards to ACE2-binding. We plot distributions of mutational effects within the core RBD versus in the RBM, for binding and expression phenotypes. As we hypothesized, mutations in the RBM tend to have a more detrimental effect on direct binding affinity (P-value `r wilcox.test(mutants[SARS2_site %in% RBD_sites[RBM==T,site_SARS2] & mutant!="*" & mutant!=wildtype,bind_avg],mutants[!(SARS2_site %in% RBD_sites[RBM==T,site_SARS2]) & mutant!="*" & mutant!=wildtype,bind_avg])$p.value`, median mutation effect `r median(mutants[SARS2_site %in% RBD_sites[RBM==T,site_SARS2] & mutant!="*" & mutant!=wildtype,bind_avg],na.rm=T)` in the RBM, `r median(mutants[!(SARS2_site %in% RBD_sites[RBM==T,site_SARS2]) & mutant!="*" & mutant!=wildtype,bind_avg],na.rm=T)` in the core RBD). In contrast, mutations in the core RBD tend to have a more detrimental effect on expression (~stability) (P-value `r wilcox.test(mutants[SARS2_site %in% RBD_sites[RBM==T,site_SARS2] & mutant!="*" & mutant!=wildtype,expr_avg],mutants[!(SARS2_site %in% RBD_sites[RBM==T,site_SARS2]) & mutant!="*" & mutant!=wildtype,expr_avg])$p.value`, median mutation effect `r median(mutants[SARS2_site %in% RBD_sites[RBM==T,site_SARS2] & mutant!="*" & mutant!=wildtype,expr_avg],na.rm=T)` in the RBM, `r median(mutants[!(SARS2_site %in% RBD_sites[RBM==T,site_SARS2]) & mutant!="*" & mutant!=wildtype,expr_avg],na.rm=T)` in the core RBD)

```{r RBM_versus_core_fold_mutations, fig.width=8,fig.height=4,fig.align="center", dpi=500,dev="png",echo=FALSE, message=FALSE, warning=FALSE}
mutants[,RBM := FALSE]
mutants[SARS2_site %in% RBD_sites[RBM==T,site_SARS2],RBM := TRUE]

p1 <- ggplot(mutants[mutant!="*" & mutant!=wildtype,], aes(x=RBM,y=bind_avg))+
  geom_violin(trim=F)+coord_flip()+
  stat_summary(fun.y=median,geom="point",size=1)+
  ylab("mutational effect on binding")+xlab("Site in RBM?")

p2 <- ggplot(mutants[mutant!="*" & mutant!=wildtype,], aes(x=RBM,y=expr_avg))+
  geom_violin(trim=F)+coord_flip()+
  stat_summary(fun.y=median,geom="point",size=1)+
  ylab("mutational effect on expression")+xlab("Site in RBM?")

grid.arrange(p1,p2,ncol=2)
```

To further visualize site-wise mutational sensitivity on the 3D structure, let's output `.pdb` files for the ACE2-bound RBD structure in which we replace the B factor column with metrics of interest for each site: 1) the mean effect of mutation on binding, 2) the mean effect of mutation on expression, 3) the max effect of any mutation on binding, and 4) the max effect of any mutation on expression. In PyMol, we can then visualize spheres at each Calpha, colored by spectrum from low (red, mean effect equal to or less than -2) to high (white, mean effect equal to or greater than 0) for each metric by manually executing the following commands in a PyMol session in which one of the output `pdb` files is opened:
```
hide all; show cartoon
color warmpink, chain A; color gray60, chain E
set sphere_scale, 0.6
create RBD_CA, chain E and name ca
hide cartoon, RBD_CA; show spheres, RBD_CA
spectrum b, red white, RBD_CA, minimum=-2, maximum=0
```

To create a surface representation of the RBD colored by mutational tolerance, execute the following commands in a PyMol session with one of these output `pdb` files loaded:
```
create RBD, chain E
hide all; show cartoon, chain A; color warmpink, chain A
show surface, RBD; spectrum b, red white, RBD, minimum=-2, maximum=0
```

Sets of commands to create more elaborate structural alignments and views are given in the `~/structural_views/` subdirectory, e.g. the `surface_constraint_commands.txt` series of commands which, when executed from a PyMol session hosted within that subdirectory loads in various RBD structures bound to ACE2, mAbs, and full Spike, and colors the RBD surface by mutational constraint.

```{r output_pdbs}
pdb <- read.pdb(file="data/structures/ACE2-bound/6m0j.pdb")

#color by mean effect on binding
pdb_mean_bind <- pdb
pdb_mean_bind$atom$b <- NA
for(i in 1:nrow(pdb_mean_bind$atom)){
  res <- pdb_mean_bind$atom$resno[i]
  chain <- pdb_mean_bind$atom$chain[i]
  mean_bind <- RBD_sites[site_SARS2==res & chain_6M0J==chain & !is.na(chain_6M0J),mean_bind]
  if(length(mean_bind)>0){pdb_mean_bind$atom$b[i] <- mean_bind}else{pdb_mean_bind$atom$b[i] <- 0}
}
#save pdb
write.pdb(pdb=pdb_mean_bind,file=paste(config$structure_function_dir,"/6m0j_b-factor-mean-bind.pdb",sep=""), b = pdb_mean_bind$atom$b)

#color by max effect on binding
pdb_max_bind <- pdb
pdb_max_bind$atom$b <- NA
for(i in 1:nrow(pdb_max_bind$atom)){
  res <- pdb_max_bind$atom$resno[i]
  chain <- pdb_max_bind$atom$chain[i]
  max_bind <- RBD_sites[site_SARS2==res & chain_6M0J==chain & !is.na(chain_6M0J),max_bind]
  if(length(max_bind)>0){pdb_max_bind$atom$b[i] <- max_bind}else{pdb_max_bind$atom$b[i] <- 0}
}
#save pdb
write.pdb(pdb=pdb_max_bind,file=paste(config$structure_function_dir,"/6m0j_b-factor-max-bind.pdb",sep=""), b = pdb_max_bind$atom$b)

#color by mean effect on expression
pdb_mean_expr <- pdb
pdb_mean_expr$atom$b <- NA
for(i in 1:nrow(pdb_mean_expr$atom)){
  res <- pdb_mean_expr$atom$resno[i]
  chain <- pdb_mean_expr$atom$chain[i]
  mean_expr <- RBD_sites[site_SARS2==res & chain_6M0J==chain & !is.na(chain_6M0J),mean_expr]
  if(length(mean_expr)>0){pdb_mean_expr$atom$b[i] <- mean_expr}else{pdb_mean_expr$atom$b[i] <- 0}
}
#save pdb
write.pdb(pdb=pdb_mean_expr,file=paste(config$structure_function_dir,"/6m0j_b-factor-mean-expr.pdb",sep=""), b = pdb_mean_expr$atom$b)

#color by max effect on expression
pdb_max_expr <- pdb
pdb_max_expr$atom$b <- NA
for(i in 1:nrow(pdb_max_expr$atom)){
  res <- pdb_max_expr$atom$resno[i]
  chain <- pdb_max_expr$atom$chain[i]
  max_expr <- RBD_sites[site_SARS2==res & chain_6M0J==chain & !is.na(chain_6M0J),max_expr]
  if(length(max_expr)>0){pdb_max_expr$atom$b[i] <- max_expr}else{pdb_max_expr$atom$b[i] <- 0}
}
#save pdb
write.pdb(pdb=pdb_max_expr,file=paste(config$structure_function_dir,"/6m0j_b-factor-max-expr.pdb",sep=""), b = pdb_max_expr$atom$b)


```

## Distribution of functional effects of mutation

Let's look at the distribution of single-mutant effects on our two phenotypes, and compare the fraction of mutations that are within the window defined by known functional RBD homologs for these two phenotypes.

For the binding plot on the left, the intermediate blue point on the x-scale is RaTG13, which *can* promote huACE2-mediated cell entry in an in vitro cellular infection assay (though less efficiently than SARS-CoV-2) according to [Shang et al. 2020](https://www.nature.com/articles/s41586-020-2179-y/figures/3), though whether this is sufficient to enable efficient viral replication in more complex models is uncertain. For the cluster of homologs near 0, the farthest-left point is LYRa11, which according to [Letko et al. 2020](https://www.nature.com/articles/s41564-020-0688-y/figures/1) can also promote huACE2-mediated cellular entry, though less efficiently than SARS-CoV-1 and other bat CoV isolates such as WIV1/16 (identical RBDs). Therefore, these two points define a window of affinites that can at least support in vitro cellular infection -- but in reality, the window of possible "neutrality" with regards to actual human infectivity is perhaps better set by the remaining four points with delta log<sub>10</sub>(*K*<sub>A,app</sub>) values ~ 0 -- these four points are the RBDs from SARS-CoV-1, WIV1/16, SARS-CoV-2, and GD-Pangolin RBDs, in that rank-order. Taken together, we identify `r nrow(mutants[wildtype != mutant & !is.na(bind_avg) & bind_avg >= homologs[homolog=="SARS-CoV-1",bind_avg],])` single mutants (`r round(nrow(mutants[wildtype != mutant & !is.na(bind_avg) & bind_avg >= homologs[homolog=="SARS-CoV-1",bind_avg],])/nrow(mutants[wildtype != mutant & !is.na(bind_avg),])*100,digits=2)`%) whose affinity effects are within a neutral window that potentially enables huACE2-mediated infectivity (SARS-CoV-1 cutoff), and `r nrow(mutants[wildtype != mutant & !is.na(bind_avg) & bind_avg >= homologs[homolog=="RaTG13",bind_avg],])` single mutants (`r round(nrow(mutants[wildtype != mutant & !is.na(bind_avg) & bind_avg >= homologs[homolog=="RaTG13",bind_avg],])/nrow(mutants[wildtype != mutant & !is.na(bind_avg),])*100,digits=2)`%) whose affinity is potentially sufficient to enable huACE2-mediated in vitro cellular infectivity (RaTG13 cutoff). Taken together, this suggests a quite large sequence space of RBD diversity that is consistent with huACE2 binding and entry. (Though, of course, our isolated RBD-only affinity measurements may have more complex constraints in the context of full-Spike trimer.) 

For expression, our range set by homologs may be improperly aligned to the SARS-CoV-2 range -- since the SARS-CoV-2 wildtype sequences went through the mutagenesis protocol, they could have acquired mutations outside the sequenced region that impact expression. This is evident in the fact that some very small fraction of wildtype barcodes are non-expressing. We removed these non-expressing wildtype barcodes for calculating mean wildtype expression compared to the homologs, which did not go through the mutagenesis scheme. So, it is unclear whether this should still impact the relationship between our SARS-CoV-2 wt expression and the homologs. However, we cannot account for this effect in our library mutants, where we cannot easily determine which barcodes are nonexpressing due to outside mutatioins versus the amino acid variant. Therefore, these artefactually low expression barcodes are still present in the SARS-CoV-2 mutant variants, which might supress the scale by 0.1 or 0.2 log-MFI units relative to the homologs. I think this is all ok since we're not reading as far into these expression effects, but it does complicate direct comparison of the mutant DFE versus homologs. Perhaps we could shift the entire DFE by the same log-MFI value by which the wildtype value shifted when we eliminated the artefactual low-expression measurements? But I don't like this either, because only a fraction of the mutant genotypes should be affected by this. Our expression values all come from the global epistasis model, which at least in theory should be able to partially able to deal with these outliers while not suffering huge decreases in the mean estimate of mutational effects on expression, because the Cauchy likelihood model does allow rare errant outliers without large loss in likelihoood/shifting of the mean parameter. Anyway, SARS-CoV-2 is therefore our lowest ascribed expression homolog, and we calculate that `r nrow(mutants[wildtype != mutant & mutant != "*" & !is.na(expr_avg) & expr_avg >= homologs[homolog=="SARS-CoV-2",expr_avg],])` single mutants (`r round(nrow(mutants[wildtype != mutant & mutant != "*" & !is.na(expr_avg) & expr_avg >= homologs[homolog=="SARS-CoV-2",expr_avg],])/nrow(mutants[wildtype != mutant & mutant != "*" & !is.na(expr_avg),])*100,digits=2)`%) of mutants have expression as high or higher than SARS-CoV-2.

```{r DFE_bind, fig.width=8.5,fig.height=4.25,fig.align="center", dpi=300,dev="png",echo=FALSE}
par(mfrow=c(1,2))
plot(density(mutants[wildtype != mutant,bind_avg],na.rm=T,adjust=0.5),main="SARS-CoV-2 mutant log10Ka,app",xlab="delta log10(KA,app)"); polygon(density(mutants[wildtype != mutant,bind_avg],na.rm=T,adjust=0.5),col="gray50",border="black")
set.seed(100);points(homologs[,bind_avg],sample(seq(1.4,1.6,by=0.01),nrow(homologs)),col=homologs[,clade_color],pch=homologs[,clade_pch],cex=1.3)
plot(density(mutants[wildtype != mutant & mutant != "*",expr_avg],na.rm=T,adjust=0.5),main="SARS-CoV-2 mutant expression",xlab="delta expression"); polygon(density(mutants[wildtype != mutant & mutant != "*",expr_avg],na.rm=T,adjust=0.5),col="gray50",border="black")
set.seed(100);points(homologs[,expr_avg],sample(seq(0.62,0.68,by=0.005),nrow(homologs)),col=homologs[,clade_color],pch=homologs[,clade_pch],cex=1.3)

invisible(dev.print(pdf, paste(config$structure_function_dir,"/DFEs.pdf",sep="")))
```

To have plots like above but compared to distribution of wt/syn, which are collapsed to a single zero point estimate in these plots, we need to go back to the barcode-level measurements themselves. We do that below, generating "joyplots".

```{r DFE_ridgeplots, fig.width=10,fig.height=4,fig.align="center", dpi=300,dev="png",echo=FALSE,message=F,warning=F}
#read in bc-level measurements of binding and expression for SARS-CoV-2 background mutants
bc_bind <- data.table(read.csv(file=config$global_epistasis_binding_file,stringsAsFactors = F))
bc_expr <- data.table(read.csv(file=config$global_epistasis_expr_file,stringsAsFactors = F))

p1 <- ggplot(bc_bind[!is.na(delta_log10Ka),],aes(x=delta_log10Ka,y=as.factor(variant_class),group=as.factor(variant_class),height= ..ndensity..))+
  geom_density_ridges(stat="density")+theme_classic()+theme(legend.position="none")+
  geom_segment(data=homologs[homolog!="SARS-CoV-2",],aes(x=bind_avg,y=0,xend=bind_avg,yend=0.7),color=homologs[homolog!="SARS-CoV-2",clade_color],inherit.aes=FALSE)

p2 <- ggplot(bc_expr[!is.na(delta_ML_meanF),],aes(x=delta_ML_meanF,y=as.factor(variant_class),group=as.factor(variant_class),height= ..ndensity..))+
  geom_density_ridges(stat="density")+theme_classic()+theme(legend.position="none")+
  geom_segment(data=homologs[homolog!="SARS-CoV-2",],aes(x=expr_avg,y=0,xend=expr_avg,yend=0.7),color=homologs[homolog!="SARS-CoV-2",clade_color],inherit.aes=FALSE)

grid.arrange(p1,p2,ncol=2)

invisible(dev.print(pdf, paste(config$structure_function_dir,"/DFEs_by_barcode.pdf",sep="")))
```

## Exploratory heatmaps

Next, let's make heatmaps of per-amino acid mutational effects on binding and expression. We first make these heatmaps for all mutations at all sites, colored by delta log<sub>10</sub>(*K*<sub>A,app</sub>). Above each set of binding values are heatmaps showing relative solvent accessibility (RSA) in the bound and unbound RBD states, as well as a binary indicator for ACE2 contact residues. The "x" in the heatmap marks the wildtype SARS-CoV-2 state, and "o" marks the SARS-CoV-1 state at positions where they differ.

(I am more of a, get the figure 90% of the way there via the code, and just do the final alignment in Illustrator -- so, I would imagine "squashing" together the three related heatmaps a bit more to one another to make the folding over in sequence from first three rows to second three rows more apparent, get rid of the redundant scales, etc. But I will wait until we have polished figures and analyses before spending time on that since it's manual!)
  
```{r heatmap_binding_all_sites, fig.width=12,fig.height=8,fig.align="center", dpi=500,dev="png",echo=FALSE}
#order mutant as a factor for grouping by rough biochemical grouping
mutants$mutant <- factor(mutants$mutant, levels=c("*","C","P","G","V","M","L","I","A","F","W","Y","T","S","N","Q","E","D","H","K","R"))
#add character vector indicating wildtype to use as plotting symbols for wt
mutants[,wildtype_indicator := ""]
mutants[mutant==wildtype,wildtype_indicator := "x"]
#add character vector indicating SARS-CoV-1 wildtype state, if different than SARS-CoV-2
mutants[,SARS1_indicator := ""]
for(i in 1:nrow(mutants)){
  SARS1aa <- RBD_sites[site_SARS2==mutants$SARS2_site[i],amino_acid_SARS1]
  if(!is.na(SARS1aa) & mutants$mutant[i] == SARS1aa & mutants$wildtype_indicator[i]==""){mutants$SARS1_indicator[i] <- "o"}
}

#make data frame for by-site RSA values
site_RSA_long <- data.table::melt(RBD_sites[,.(site_SARS2,RSA_unbound,RSA_bound)],id.vars=c("site_SARS2"),measure.vars=c("RSA_unbound","RSA_bound"),variable.name="structure",value.name="RSA");site_RSA_long[structure=="RSA_unbound",structure:="unbound"];site_RSA_long[structure=="RSA_bound",structure:="bound"]

#make variable for factor by ACE2 contact for by-site contact T/F values
RBD_sites[,ACE2_contact := as.factor(0)];RBD_sites[SARS2_ACE2_contact==T,ACE2_contact:=as.factor(1)];RBD_sites[SARS1_ACE2_contact==T,ACE2_contact:=as.factor(1)]

p1 <- ggplot(RBD_sites[site_SARS2 %in% seq(331,431),],aes(site_SARS2,y=1))+geom_tile(aes(fill=ACE2_contact))+
  scale_fill_manual(values=c("0"="white","1"="black"))+
  scale_x_continuous(expand=c(0,0),breaks=c(331,seq(335,430,by=5)))+
  labs(x="",y="ACE2\ncontact")+theme_classic(base_size=9)+
  coord_equal()+
  theme(axis.ticks.y = element_blank(),axis.text.y=element_blank(),axis.text.x=element_blank(),legend.position="none")

p2 <- ggplot(site_RSA_long[site_SARS2 %in% seq(331,431),],aes(site_SARS2,structure))+geom_tile(aes(fill=RSA))+
  scale_fill_gradientn(colors=c("#6a0dad","#ffffff"),limits=c(0,1),values=c(0,1),na.value="gray")+
  scale_x_continuous(expand=c(0,0),breaks=c(331,seq(335,430,by=5)))+
  labs(x="",y="structure")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x=element_blank())

p3 <- ggplot(mutants[mutant!="*" & SARS2_site %in% seq(331,431),],aes(SARS2_site,mutant))+geom_tile(aes(fill=bind_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,0.5),values=c(0,2.5/5.5,5/5.5,5.25/5.5,5.5/5.5),na.value="gray")+
  scale_x_continuous(expand=c(0,0),breaks=c(331,seq(335,430,by=5)))+
  labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")+
  geom_text(aes(label=SARS1_indicator),size=2,color="gray10")


p4 <- ggplot(RBD_sites[site_SARS2 %in% seq(432,531),],aes(site_SARS2,y=1))+geom_tile(aes(fill=ACE2_contact))+
  scale_fill_manual(values=c("0"="white","1"="black"))+
  scale_x_continuous(expand=c(0,0),breaks=c(432,seq(435,530,by=5)))+
  labs(x="",y="ACE2\ncontact")+theme_classic(base_size=9)+
  coord_equal()+
  theme(axis.ticks.y = element_blank(),axis.text.y=element_blank(),axis.text.x=element_blank(),legend.position="none")

p5 <- ggplot(site_RSA_long[site_SARS2 %in% seq(432,531),],aes(site_SARS2,structure))+geom_tile(aes(fill=RSA))+
  scale_fill_gradientn(colors=c("#6a0dad","#ffffff"),limits=c(0,1),values=c(0,1),na.value="gray")+
  scale_x_continuous(expand=c(0,0),breaks=c(432,seq(435,530,by=5)))+
  labs(x="",y="structure")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x=element_blank())

p6 <- ggplot(mutants[mutant!="*" & SARS2_site %in% seq(432,531),],aes(SARS2_site,mutant))+geom_tile(aes(fill=bind_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,0.5),values=c(0,2.5/5.5,5/5.5,5.25/5.5,5.5/5.5),na.value="gray")+
  scale_x_continuous(expand=c(0,0),breaks=c(432,seq(435,530,by=5)))+
  labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")+
  geom_text(aes(label=SARS1_indicator),size=2,color="gray10")

ggarrange(p1,p2,p3,p4,p5,p6,nrow=6)

invisible(dev.print(pdf, paste(config$structure_function_dir,"/heatmap_all-bind.pdf",sep="")))
```

And next, the same heat map, colored by delta mean fluorescence (expression). I altered the scale such that anything less than -4 gets the darkest red color -- the lowest missense mutant score is -4.07, only nonsense mutants push the scale down to -5, so this helps the relative red scale be better calibrated between the binding and expression measurements w.r.t. single missense mutations. (once again, happy for thoughts.). Obviously if these are shown side by side with the binding values, we can remove the redundant RSA and contact plots:

```{r heatmap_expression_all, fig.width=12,fig.height=8,fig.align="center", dpi=500,dev="png",echo=FALSE}
p1 <- ggplot(RBD_sites[site_SARS2 %in% seq(331,431),],aes(site_SARS2,y=1))+geom_tile(aes(fill=ACE2_contact))+
  scale_fill_manual(values=c("0"="white","1"="black"))+
  scale_x_continuous(expand=c(0,0),breaks=c(331,seq(335,430,by=5)))+
  labs(x="",y="ACE2\ncontact")+theme_classic(base_size=9)+
  coord_equal()+
  theme(axis.ticks.y = element_blank(),axis.text.y=element_blank(),axis.text.x=element_blank(),legend.position="none")

p2 <- ggplot(site_RSA_long[site_SARS2 %in% seq(331,431),],aes(site_SARS2,structure))+geom_tile(aes(fill=RSA))+
  scale_fill_gradientn(colors=c("#6a0dad","#ffffff"),limits=c(0,1),values=c(0,1),na.value="gray")+
  scale_x_continuous(expand=c(0,0),breaks=c(331,seq(335,430,by=5)))+
  labs(x="",y="structure")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x=element_blank())

p3 <- ggplot(mutants[SARS2_site %in% seq(331,431),],aes(SARS2_site,mutant))+geom_tile(aes(fill=expr_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,1),values=c(0,1/6,3/6,5/6,5.5/6,6/6),na.value="gray")+
  scale_x_continuous(expand=c(0,0),breaks=c(331,seq(335,430,by=5)))+
  labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")+
  geom_text(aes(label=SARS1_indicator),size=2,color="gray10")

p4 <- ggplot(RBD_sites[site_SARS2 %in% seq(432,531),],aes(site_SARS2,y=1))+geom_tile(aes(fill=ACE2_contact))+
  scale_fill_manual(values=c("0"="white","1"="black"))+
  scale_x_continuous(expand=c(0,0),breaks=c(432,seq(435,530,by=5)))+
  labs(x="",y="ACE2\ncontact")+theme_classic(base_size=9)+
  coord_equal()+
  theme(axis.ticks.y = element_blank(),axis.text.y=element_blank(),axis.text.x=element_blank(),legend.position="none")

p5 <- ggplot(site_RSA_long[site_SARS2 %in% seq(432,531),],aes(site_SARS2,structure))+geom_tile(aes(fill=RSA))+
  scale_fill_gradientn(colors=c("#6a0dad","#ffffff"),limits=c(0,1),values=c(0,1),na.value="gray")+
  scale_x_continuous(expand=c(0,0),breaks=c(432,seq(435,530,by=5)))+
  labs(x="",y="structure")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x=element_blank())

p6 <- ggplot(mutants[SARS2_site %in% seq(432,531),],aes(SARS2_site,mutant))+geom_tile(aes(fill=expr_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,1),values=c(0,2.5/6,5/6,5.5/6,6/6),na.value="gray")+
  scale_x_continuous(expand=c(0,0),breaks=c(432,seq(435,530,by=5)))+
  labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")+
  geom_text(aes(label=SARS1_indicator),size=2,color="gray10")

ggarrange(p1,p2,p3,p4,p5,p6,nrow=6)

invisible(dev.print(pdf, paste(config$structure_function_dir,"/heatmap_all-expr.pdf",sep="")))
```

Also, next to these single-mutant effect heatmaps in a composite paper figure, it might be useful to have heatmaps illustrating the homolog phenotype scores. I only am so skilled at managing single-plot layouts, so I construct these heatmaps separately below.

```{r heatmap_homolog_phenotypes, fig.width=5,fig.height=4,fig.align="center", dpi=500,dev="png",echo=FALSE,message=F}
homologs$homolog <- factor(homologs$homolog,levels=homologs$homolog)

p1 <- ggplot(homologs,aes(homolog,y=1))+geom_tile(aes(fill=bind_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,0.5),values=c(0,2.5/5.5,5/5.5,5.25/5.5,5.5/5.5),na.value="gray")+
  scale_x_discrete(expand=c(0,0),labels=as.character(homologs$homolog))+
  labs(x="homolog",y="")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.ticks.y = element_blank(),axis.text.y=element_blank(),plot.margin=margin(5.5,5.5,5.5,20),
                      axis.text.x=element_text(angle=45,hjust=1,color=homologs$clade_color,face="bold",size=10))

p2 <- ggplot(homologs,aes(homolog,y=1))+geom_tile(aes(fill=expr_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,1),values=c(0,1/6,3/6,5/6,5.5/6,6/6),na.value="gray")+
  scale_x_discrete(expand=c(0,0),labels=as.character(homologs$homolog))+
  labs(x="homolog",y="")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.ticks.y = element_blank(),axis.text.y=element_blank(),plot.margin=margin(5.5,5.5,5.5,20),
                      axis.text.x=element_text(angle=45,hjust=1,color=homologs$clade_color,face="bold",size=10))

ggarrange(p1,p2,nrow=2)

invisible(dev.print(pdf, paste(config$structure_function_dir,"/heatmap_homologs.pdf",sep="")))
```

Though I think these "overall" heatmap(s) will serve well in the paper, I think for some of the sub-observations we want to make, it helps to zoom in particular groupings of columns in the heatmap. I do so through the following series of interpretative summaries, and I imagine if we want to focus on any of these points, we could have a supplementary figure which shows these zoomed in heatmaps that highlight the relevant points that come out from this overall heatmap.

First, let's visualize heatmaps of paired cysteines that form disulfides. The following heatmaps reorder sites by cysteine pair. (I would love to add a line at the top grouping the paired cysteines, or create a small gap between each pair of columns, but I am gg-inept and so this will do for now.)

We can see that cysteines within a disulfide pair have similiar sensitivities to mutation and even similar biochemical preferences, but there are differing patterns of sensitivity for binding and expression. Disulfide pair 4 is the most sensitive to mutation with respect to binding, whereas it only is moderately important for expression -- this is the disulfide pair within the RBM loop that stabilizes regions of the ACE2 interface, consistent with its exacerbated importance for binding versus expression. The remaining three disulfides are in the core RBD domain, and show similar sensitivities for expression and binding -- pair 3 is tolerant to mutation (and may even support replacement with polar amino acids), pair 1 is moderately sensitive, and pair 2 is most sensitive -- though hydrophobic mutations have noticeably decreased deleterious effect for binding, which is interesting to see.

This trend is consistent with a series of Cys->Ala mutations made in SARS-CoV-2 RBD in [Wong et al. JBC 2004](https://www.jbc.org/content/279/5/3197.short) (I report the numbers converted to SARS-CoV-2 numbering): mutations at SARS-CoV-1 RBD positions 379 and 432 severely impaired expression, mutations at 361, 480, and 488 had only mild effects on expression but strongly decreased ACE2-binding, and mutations at 336 and 391 had little effect on expression or binding (this is in 331-524 RBD construct, so lacking cysteine 525).

```{r heatmap_bind_expr_disulfide, fig.width=4,fig.height=3,fig.align="center", dpi=500,dev="png",echo=FALSE}
muts_temp <- mutants[SARS2_site %in% RBD_sites[!is.na(disulfide),site_SARS2],];muts_temp$SARS2_site <- as.factor(muts_temp$SARS2_site)
muts_temp[,disulfide:=RBD_sites[site_SARS2==SARS2_site,disulfide],by=mutation];muts_temp$disulfide <- as.factor(muts_temp$disulfide)


p1 <- ggplot(muts_temp,aes(paste(disulfide,SARS2_site),mutant))+geom_tile(aes(fill=bind_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,0.5),values=c(0,2.5/5.5,5/5.5,5.25/5.5,5.5/5.5),na.value="gray")+
  scale_x_discrete(expand=c(0,0),labels=as.character(unique(muts_temp[order(disulfide,SARS2_site),SARS2_site])))+
  labs(x="RBD site, grouped\nby disulfide pair",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")

p2 <- ggplot(muts_temp,aes(paste(disulfide,SARS2_site),mutant))+geom_tile(aes(fill=expr_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,1),values=c(0,1/6,3/6,5/6,5.5/6,6/6),na.value="gray")+
  scale_x_discrete(expand=c(0,0),labels=as.character(unique(muts_temp[order(disulfide,SARS2_site),SARS2_site])))+
  labs(x="RBD site, grouped\nby disulfide pair",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")

grid.arrange(p1,p2,ncol=2)

invisible(dev.print(pdf, paste(config$structure_function_dir,"/heatmaps_disulfides.pdf",sep="")))
```

Next, let's look at the N-linked glycosylation sites. In particular, we look at mutational effects at NLGS motifs consisting of N331 and T333, and N343 and T345, as well as N370 and A372, which in SARS-CoV-1 is an NST NLGS motif. A prior paper, [Chen et al. 2014](https://www.tandfonline.com/doi/full/10.4161/hv.27464), showed that in SARS-CoV-1 RBD produced in Pichia yeast, yields were lowered when progressively knocking out each of these glycans, suggesting they are important for expression. (They didn't do individual knockouts, it seems.) Finally, on the righthand side, we look at the effects of mutations i+2 from surface asparagines, to see whether potential glycan knockins (mutations to S or T) have any interesting effects.

As expected the two NLGS in the SARS-CoV-2 RBD are important for stability. We can see that mutations to the focal asparagine or the N+2 threonine are universally deleterious with regards to expression, with the exception of T->S mutations which retain the NLGS. Overall, the N343 glycan appears more important to RBD stability than the N331 glycan. Re-introducing the SARS-CoV-1 NLGS at site 370 has just a mildly deleterious effect on expression. None of these three glycan mutants have major impacts on binding, consistent with their distance from the ACE2-interface.

At other non-glycosylated asparagines, I don't see many strong effects when introducing putative NLGS motifs with an i+2 mutation to S/T. Glycosylation of N354, N360, N448, N450, N481, and N487 may have mildly beneficial effects on expression, and glycosylation may be ~neutral or tolerated with only minor detrimental effects with regards to expression at sites 388, 394, 437, 439, 460, and 501. These sites are [visualized on the RBD structure](https://dms-view.github.io/?pdb-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2F6m0j.pdb&markdown-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2FBloomLab_rbd.md&data-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2Fresults%2FBloomLab2020_rbd.csv&condition=natural+frequencies&site_metric=site_entropy&mutation_metric=mut_frequency&selected_sites=354%2C360%2C388%2C394%2C437%2C439%2C448%2C450%2C460%2C481%2C487%2C501), illustrating that quite a few surfaces on the RBD could potentially be masked with glycan introductions, including different portions of the RBM if wanting to target different epitopes with vaccine immunogens or probes for isolating mAbs for particular epitope surfaces.

In contrast to novel glycans that could be tolerated in an engineered RBD construct with respect to stability/expression, we can see that introduction of an NLGS to N501 with mutation of site 503 to T or S may have a detrimental effect on binding, consistent with this interface residue making key ACE2-interactions (site 501 is one of the "key contacts" from the SARS-CoV-1 literature). Knockin of a possible N439 NLGS (via T/S mutations at site 441) also has a mild deleterious effect on affinity, specific to the T/S mutations at this i+2 position; site 439 is sort of "second shell" from the ACE2 interface, but close enough to imagine a glycan could impact affinity. There are other positions where i+2 S/T mutations have large deleterious effects on binding affinity (putatively glycosylating N422 (sort of buried, so might not actually be glycosylated) and N487 (interface!)), but other amino acid mutations at these positions also have strong deleterious effects, so it is hard to know whether the effect of the T/S mutants stems from the addition of a glycan, or loss of the wildtype amino acid at this i+2 residue independent of the glycan effect.

```{r heatmap_bind_expr_NLGS, fig.width=12,fig.height=4,fig.align="center", dpi=500,dev="png",echo=FALSE}
muts_temp <- mutants[SARS2_site %in% RBD_sites[!is.na(NLGS) & NLGS != "surfaceN",site_SARS2],];muts_temp$SARS2_site <- as.factor(muts_temp$SARS2_site)
muts_temp[,NLGS:=RBD_sites[site_SARS2==SARS2_site,NLGS],by=mutation];muts_temp$NLGS <- factor(muts_temp$NLGS, levels=c("putative1","putative2","SARS-CoV-1"))


p1 <- ggplot(muts_temp,aes(paste(NLGS,SARS2_site),mutant))+geom_tile(aes(fill=bind_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,0.5),values=c(0,2.5/5.5,5/5.5,5.25/5.5,5.5/5.5),na.value="gray")+
  scale_x_discrete(expand=c(0,0),labels=as.character(unique(muts_temp[order(NLGS,SARS2_site),SARS2_site])))+
  labs(x="RBD site, grouped\nby NLGS motif",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")

p2 <- ggplot(muts_temp,aes(paste(NLGS,SARS2_site),mutant))+geom_tile(aes(fill=expr_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,1),values=c(0,1/6,3/6,5/6,5.5/6,6/6),na.value="gray")+
  scale_x_discrete(expand=c(0,0),labels=as.character(unique(muts_temp[order(NLGS,SARS2_site),SARS2_site])))+
  labs(x="RBD site, grouped\nby NLGS motif",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")

muts_temp <- mutants[SARS2_site %in% c(RBD_sites[NLGS == "surfaceN",site_SARS2]+2),];muts_temp$SARS2_site <- factor(muts_temp$SARS2_site)

p3 <- ggplot(muts_temp,aes(SARS2_site,mutant))+geom_tile(aes(fill=bind_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,0.5),values=c(0,2.5/5.5,5/5.5,5.25/5.5,5.5/5.5),na.value="gray")+
  scale_x_discrete(expand=c(0,0),labels=as.character(unique(muts_temp$SARS2_site)))+
  labs(x="RBD site,\nN+2 residues",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")

p4 <- ggplot(muts_temp,aes(SARS2_site,mutant))+geom_tile(aes(fill=expr_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,1),values=c(0,1/6,3/6,5/6,5.5/6,6/6),na.value="gray")+
  scale_x_discrete(expand=c(0,0),labels=as.character(unique(muts_temp$SARS2_site)))+
  labs(x="RBD site,\nN+2 residues",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")


ggarrange(p1,p2,p3,p4,nrow=1,widths=c(1,1,2.5,2.5))
invisible(dev.print(pdf, paste(config$structure_function_dir,"/heatmaps_NLGS.pdf",sep="")))

```

How does number of glycans in multi-muts correlate with binding and expression phenotypes?

```{r n_NLGS_phenotype_corr, fig.width=6,fig.height=4,fig.align="center", dpi=300,dev="png",echo=FALSE,message=F,warning=F}
#list of all mutations that add a glycan
NLGS_KI <- c()
for(i in 3:(nrow(RBD_sites)-2)){
  if(!(RBD_sites$NLGS[i] %in% c("putative1","putative2"))){
    if(RBD_sites$amino_acid_SARS2[i] == "N" & RBD_sites$amino_acid_SARS2[i+1] != "P"){
      NLGS_KI <- c(NLGS_KI, paste(RBD_sites$amino_acid_SARS2[i+2],RBD_sites$site_RBD[i+2],"S",sep=""),paste(RBD_sites$amino_acid_SARS2[i+2],RBD_sites$site_RBD[i+2],"T",sep=""))
    }
    if(RBD_sites$amino_acid_SARS2[i] %in% c("S","T") & RBD_sites$amino_acid_SARS2[i-1] != "P"){
      NLGS_KI <- c(NLGS_KI, paste(RBD_sites$amino_acid_SARS2[i-2],RBD_sites$site_RBD[i-2],"N",sep=""))
    }
  }
}

#add annotation to bc_bind about how many NLGS additions are present
bc_bind[,n_NLGS_KI := sum(strsplit(aa_substitutions,split=" ")[[1]] %in% NLGS_KI),by=aa_substitutions]

#not good coverage of multi-NLGS-KI muts

#what about just the raw single mutant effects of NGLS KI's, and compare to other N/S/T mutaitons?
par(mfrow=c(1,2))
boxplot(mutants[mutation_RBD %in% NLGS_KI,bind_avg], mutants[mutant %in% c("N","S","T"),bind_avg],names=c("NLGS KI","all N/S/T muts"),ylab="mutation effect on binding",notch=T)
#wilcox.test(mutants[mutation_RBD %in% NLGS_KI,bind_avg], mutants[mutant %in% c("N","S","T"),bind_avg])

boxplot(mutants[mutation_RBD %in% NLGS_KI,expr_avg], mutants[mutant %in% c("N","S","T"),expr_avg],names=c("NLGS KI","all N/S/T muts"),ylab="mutation effect on expression",notch=T)
#wilcox.test(mutants[mutation_RBD %in% NLGS_KI,expr_avg], mutants[mutant %in% c("N","S","T"),expr_avg])

invisible(dev.print(pdf, paste(config$structure_function_dir,"/boxplots_NLGS-KI.pdf",sep=""),useDingbats=F))
```

 
Next, let's look at expression and binding effects for annotated contact residues. Below, we are looking at the 19 residues that form ACE2 contacts in the SARS-CoV-2 (6m0j) or SARS-CoV-1 (2ajf) ACE2-bound RBD crystal structures, where we annotated a residue as a contact if it contains non-hydrogen atoms within 4 Angstroms of ACE2. 14 residues were annotated as contacts in both structures, 3 contacts (417, 446, 475) are unique to the SARS-CoV-2 structure, and 2 (439, 503) are unique to SARS-CoV-1. We indicate the wildtype SARS-CoV-2 and -1 amino acids with an "x" and "o", respectively. We also add site 494, which although not a direct contact, is on the ACE2-contacting interface of the RBM, and is considered one of the sites of "key adaptation" from the SARS-CoV-1 literature.

Putting this reduced display of binding and expression next to each other is really interesting, as it highlights some potential binding-expression tradeoffs at positions 417, 449, 455, 486, 502, and 505, which are visualized in `dms-view` [here](https://dms-view.github.io/?pdb-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2F6m0j.pdb&markdown-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2FBloomLab_rbd.md&data-url=https%3A%2F%2Fraw.githubusercontent.com%2Fdms-view%2FSARS-CoV-2%2Fmaster%2Fdata%2FSpike%2FBloomLab2020%2Fresults%2FBloomLab2020_rbd.csv&condition=natural+frequencies&site_metric=site_entropy&mutation_metric=mut_frequency&selected_sites=417%2C449%2C455%2C486%2C502%2C505). 

For four of these residues (449, 455, 486, 505), binding prefers the wildtype hydrophobic state, though this exposed hydrophobic amino acid is evidently detrimental to expression, as polar amino acid mutations improve expression. Site 417 is the opposite, consistent with its more buried position a bit further from ACE2 -- for expression, keeping this amino acid hydrophobic would be preferred, but mutations to K or R can enhance affinity, presumably because their long side chains can snorkel out and make contact with ACE2 interface. Finally, site 502 is a glycine, which accomodates the ACE2 loop bearing the critical K31 "hotspot" residue, which would clash sterically if mutating site 502 to any amino acid with a side chain, even though most polar side chains at this position would improve exression. These positions are generally solvent-accessible even in the full-Spike trimer RBD "down" conformation, but if we want to follow up on this we should really look more into whether full Spike trimer would ameliorate any of these potential detrimental expression effects of surface-exposed hydrophobic residues.

Within this heatmap is specific information about the "key" contact sites that are often discussed in the literature. These sites primarily came from observations about amino acid changes relating civet- and human-adapted SARS-CoV-1 sequences, and other experimental evolution and structural studies in the SARS-CoV-1 side of the tree. These sites are `r RBD_sites[SARS1_key_adaptation==T,site_SARS2]`. May be worth extra glances when looking at these exploratory data, though other positions may be more important here in the SARS-CoV-2 side of the tree.

```{r heatmap_ACE2_contacts, fig.width=10,fig.height=6,fig.align="center", dpi=500,dev="png",echo=FALSE}
#order mutant as a factor for grouping by rough biochemical grouping
mutants$mutant <- factor(mutants$mutant, levels=c("*","C","P","G","V","M","L","I","A","F","W","Y","T","S","N","Q","E","D","H","K","R"))
#add character vector indicating RaTG13 state for positions that differ from SARS-CoV-2
mutants[,RaTG13_indicator := ""]
for(i in 1:nrow(mutants)){
  RaTG13aa <- RBD_sites[site_SARS2==mutants$SARS2_site[i],amino_acid_RaTG13]
  if(!is.na(RaTG13aa) & mutants$mutant[i] == RaTG13aa & mutants$wildtype_indicator[i]==""){mutants$RaTG13_indicator[i] <- "^"}
}
#add character vector indicating GD-Pangolin state for positions that differ from SARS-CoV-2
mutants[,Pangolin_indicator := ""]
for(i in 1:nrow(mutants)){
  Pangolinaa <- RBD_sites[site_SARS2==mutants$SARS2_site[i],amino_acid_GD_Pangolin]
  if(!is.na(Pangolinaa) & mutants$mutant[i] == Pangolinaa & mutants$wildtype_indicator[i]==""){mutants$Pangolin_indicator[i] <- "#"}
}

muts_temp <- mutants[SARS2_site %in% RBD_sites[SARS2_ACE2_contact==T | SARS1_ACE2_contact==T | SARS1_key_adaptation==T,site_SARS2],];muts_temp$SARS2_site <- as.factor(muts_temp$SARS2_site)

p1 <- ggplot(muts_temp,aes(SARS2_site,mutant))+geom_tile(aes(fill=bind_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,0.5),values=c(0,2.5/5.5,5/5.5,5.25/5.5,5.5/5.5),na.value="gray")+
  scale_x_discrete(expand=c(0,0))+
  labs(x="RBD ACE2-contact site",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")+
  geom_text(aes(label=SARS1_indicator),size=2,color="gray10")+
  geom_text(aes(label=RaTG13_indicator),size=2,color="gray10")+
  geom_text(aes(label=Pangolin_indicator),size=2,color="gray10")

p2 <- ggplot(muts_temp,aes(SARS2_site,mutant))+geom_tile(aes(fill=expr_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,1),values=c(0,1/6,3/6,5/6,5.5/6,6/6),na.value="gray")+
  scale_x_discrete(expand=c(0,0))+
  labs(x="RBD ACE2-contact site",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")+
  geom_text(aes(label=SARS1_indicator),size=2,color="gray10")+
  geom_text(aes(label=RaTG13_indicator),size=2,color="gray10")+
  geom_text(aes(label=Pangolin_indicator),size=2,color="gray10")

grid.arrange(p1,p2,ncol=2)

invisible(dev.print(pdf, paste(config$structure_function_dir,"/heatmaps_contact-residues.pdf",sep="")))
```
 Output heatmaps for illustrating stability/binding tradeoffs in particular
```{r heatmap_stability_binding_tradeoffs, fig.width=6,fig.height=6,fig.align="center", dpi=500,dev="png",echo=FALSE}
muts_temp <- mutants[SARS2_site %in% c(502,449, 455, 486, 505),];muts_temp$SARS2_site <- as.factor(muts_temp$SARS2_site)

p1 <- ggplot(muts_temp,aes(SARS2_site,mutant))+geom_tile(aes(fill=bind_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,0.5),values=c(0,2.5/5.5,5/5.5,5.25/5.5,5.5/5.5),na.value="gray")+
  scale_x_discrete(expand=c(0,0))+
  labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")

p2 <- ggplot(muts_temp,aes(SARS2_site,mutant))+geom_tile(aes(fill=expr_avg))+
  scale_fill_gradientn(colours=c("#A94E35","#A94E35","#F48365","#FFFFFF","#7378B9","#383C6C"),limits=c(-5,1),values=c(0,1/6,3/6,5/6,5.5/6,6/6),na.value="gray")+
  scale_x_discrete(expand=c(0,0))+
  labs(x="RBD site",y="mutant")+theme_classic(base_size=9)+
  coord_equal()+theme(axis.text.x = element_text(angle=45,hjust=1))+
  geom_text(aes(label=wildtype_indicator),size=2,color="gray10")
  
grid.arrange(p1,p2,ncol=2)

invisible(dev.print(pdf, paste(config$structure_function_dir,"/heatmaps_contact-residues_stability-binding-tradeoff.pdf",sep="")))

```

To better represent these contact residue differences among these three CoV isolates and SARS-CoV-2, let's make a more focused plot on interface residues, showing the average effect of the 19 mutations at that site and the effect of the mutation to the homolog state, faceted by the different homolog backgrounds.

```{r stripplot_ACE2_contacts_homologs, fig.width=8,fig.height=3,fig.align="center", dpi=500,dev="png",echo=FALSE}
#make data table with relevant info for grouping mutants. Make separately for each homolog of interest
data.SARS1 <- RBD_sites[(SARS2_ACE2_contact==T | SARS1_ACE2_contact==T | SARS1_key_adaptation==T) & amino_acid_SARS1 != amino_acid_SARS2 & amino_acid_SARS1 != "-",.(site_SARS2,amino_acid_SARS2,amino_acid_SARS1, mean_bind)]
setnames(data.SARS1, "amino_acid_SARS1","amino_acid_homolog")
data.SARS1[,color_homolog := homologs[homolog=="SARS-CoV-1",clade_color]]
data.SARS1[,pch_homolog := homologs[homolog=="SARS-CoV-1",clade_pch]]
data.SARS1[,effect_homolog := mutants[SARS2_site==site_SARS2 & mutant==amino_acid_homolog,bind_avg],by=site_SARS2]
data.SARS1[,homolog:="SARS-CoV-1"]

data.GDPangolin <- RBD_sites[(SARS2_ACE2_contact==T | SARS1_ACE2_contact==T | SARS1_key_adaptation==T) & amino_acid_GD_Pangolin != amino_acid_SARS2 & amino_acid_GD_Pangolin != "-",.(site_SARS2,amino_acid_SARS2,amino_acid_GD_Pangolin, mean_bind)]
setnames(data.GDPangolin, "amino_acid_GD_Pangolin","amino_acid_homolog")
data.GDPangolin[,color_homolog := homologs[homolog=="GD-Pangolin",clade_color]]
data.GDPangolin[,pch_homolog := homologs[homolog=="GD-Pangolin",clade_pch]]
data.GDPangolin[,effect_homolog := mutants[SARS2_site==site_SARS2 & mutant==amino_acid_homolog,bind_avg],by=site_SARS2]
data.GDPangolin[,homolog:="GD-Pangolin"]

data.RaTG13 <- RBD_sites[(SARS2_ACE2_contact==T | SARS1_ACE2_contact==T | SARS1_key_adaptation==T) & amino_acid_RaTG13 != amino_acid_SARS2 & amino_acid_RaTG13 != "-",.(site_SARS2,amino_acid_SARS2,amino_acid_RaTG13, mean_bind)]
setnames(data.RaTG13, "amino_acid_RaTG13","amino_acid_homolog")
data.RaTG13[,color_homolog := homologs[homolog=="RaTG13",clade_color]]
data.RaTG13[,pch_homolog := homologs[homolog=="RaTG13",clade_pch]]
data.RaTG13[,effect_homolog := mutants[SARS2_site==site_SARS2 & mutant==amino_acid_homolog,bind_avg],by=site_SARS2]
data.RaTG13[,homolog:="RaTG13"]

#dt <- rbind(data.SARS1,data.GDPangolin,data.RaTG13)

#make the site columns character vectors for plotting
data.SARS1$site_SARS2 <- factor(as.character(data.SARS1$site_SARS2))
data.GDPangolin$site_SARS2 <- factor(as.character(data.GDPangolin$site_SARS2))
data.RaTG13$site_SARS2 <- factor(as.character(data.RaTG13$site_SARS2))

p1 <- ggplot(data.SARS1,aes(x=site_SARS2))+
  geom_point(aes(y=mean_bind),col="gray50",shape=4,position=position_nudge(x=-0.1),size=2)+
  geom_point(aes(y=effect_homolog),col="black",shape=16,position=position_nudge(x=0.1),size=2)+
  scale_y_continuous(limits=c(-2.5,0.5),expand=c(0,0))+
  theme_classic()+theme(axis.text.x=element_text(angle=90,hjust=1),plot.title=element_text(hjust=0.5),axis.title.x=element_blank())+
  geom_hline(yintercept=0,linetype=2,color="gray50")+
  ylab("effect on binding")+ggtitle("SARS-CoV-1")
  
p2 <- ggplot(data.GDPangolin,aes(x=site_SARS2))+
  geom_point(aes(y=mean_bind),col="gray50",shape=4,position=position_nudge(x=-0.1),size=2)+
  geom_point(aes(y=effect_homolog),col="black",shape=16,position=position_nudge(x=0.1),size=2)+
  scale_y_continuous(limits=c(-2.5,0.5),expand=c(0,0))+
  theme_classic()+theme(axis.text.x=element_text(angle=90,hjust=1),plot.title=element_text(hjust=0.5),axis.title.y=element_blank())+
  geom_hline(yintercept=0,linetype=2,color="gray50")+
  xlab("site of homolog difference")+ggtitle("GD-Pangolin")

p3 <- ggplot(data.RaTG13,aes(x=site_SARS2))+
  geom_point(aes(y=mean_bind),col="gray50",shape=4,position=position_nudge(x=-0.1),size=2)+
  geom_point(aes(y=effect_homolog),col="black",shape=16,position=position_nudge(x=0.1),size=2)+
  scale_y_continuous(limits=c(-2.5,0.5),expand=c(0,0))+
  theme_classic()+theme(axis.text.x=element_text(angle=90,hjust=1),plot.title=element_text(hjust=0.5),axis.title.x=element_blank(),axis.title.y=element_blank())+
  geom_hline(yintercept=0,linetype=2,color="gray50")+
  ggtitle("RaTG13")

ggarrange(p1,p2,p3,nrow=1,widths=c(1,1/6,2/3))

invisible(dev.print(pdf, paste(config$structure_function_dir,"/homolog_differences_jitter.pdf",sep=""),useDingbats=F))
```

## Validation mutants

Based on gazing at heatmaps (from this notebook and in the `sarbecovirus_diversity` notebook), preliminary analysis of circulating SARS-CoV-2 RBD mutants, and structures and prior literature, I have proposed the following validation mutants. As you can see, several positions are prioritized in these panels -- sites 455, and 501 are positions of interest from prior literature on SARS-CoV-1 adaptation; site 502 is the second most constrained position w.r.t. to binding in our dataset and exhibits binding/stability tradeoffs, with the most sensitive position (G431) being more constrained by stability/expression effects rather than binding itself per se. Site 498 exhibits lots of differences amongst the relevant strains I've been looking at (SARS-CoV-2 versus -1, RaTG13, GD-Pangolin) and has large variation in functional effects of mutation, and so is clearly a position of interest for our data. All of these positions are in the RBM and direct or near-direct ACE2 contact positions. Other mutations were selected because they have been sampled at higher rates than other SARS-CoV-2 mutant variants (N439K, V367F, T478I, V483A), and C432D was selected to investigate how core RBD stability effects manifest in pseudovirus phenotypes in a full Spike context.

Taken together, I would propose to validate the following ten mutations in isogenic yeast display experiments:

```{r yeast_display_validation_table, echo=FALSE}
#indicator for wildtype RaTG13 state
mutants[,RaTG13_indicator := ""]
for(i in 1:nrow(mutants)){
  RaTG13aa <- RBD_sites[site_SARS2==mutants$SARS2_site[i],amino_acid_RaTG13]
  if(!is.na(RaTG13aa) & mutants$mutant[i] == RaTG13aa){mutants$RaTG13_indicator[i] <- "^"}
}

#indicator for wildtype GD-Pangolin state
mutants[,GDPangolin_indicator := ""]
for(i in 1:nrow(mutants)){
  GDPangolinaa <- RBD_sites[site_SARS2==mutants$SARS2_site[i],amino_acid_GD_Pangolin]
  if(!is.na(GDPangolinaa) & mutants$mutant[i] == GDPangolinaa){mutants$GDPangolin_indicator[i] <- "#"}
}

kable(mutants[mutation %in% c("N439K","V367F","T478I","V483A","Q498Y","Q498H","N501D","N501T","N501F","G502D"),.(mutation,RBD_site,bind_lib1,bind_lib2,bind_avg,expr_lib1,expr_lib2,expr_avg,SARS1_indicator,RaTG13_indicator,GDPangolin_indicator)])
```

For pseudovirus growth assays, I would propose to validate the following seven mutations:

```{r  pseudovirus_validation_table, echo=FALSE}
kable(mutants[mutation %in% c("C432D","N439K","L455Y","Q498Y","N501D","N501F","G502D"),.(mutation,RBD_site,bind_lib1,bind_lib2,bind_avg,expr_lib1,expr_lib2,expr_avg,SARS1_indicator,RaTG13_indicator,GDPangolin_indicator)])
```


## Epistasis in library double mutants

Do we see evidence for specific epistasis between pairs of mutations found in our (sparse sampling) of double mutants?

First, we take our barcode measurements for double mutants, and add to the data table the component single mutation effect scores.

```{r pairwise_deviations}
#read in per-bc func scores
bc_bind <- data.table(read.csv(file=config$global_epistasis_binding_file,stringsAsFactors = F))[,.(library, target, barcode, variant_call_support, avgcount, log10Ka, delta_log10Ka, log10SE, response, baseline, nMSR, variant_class, aa_substitutions, n_aa_substitutions)]
bc_expr <- data.table(read.csv(file=config$global_epistasis_expr_file,stringsAsFactors = F))[,.(library, target, barcode, variant_call_support, total_count, ML_meanF, delta_ML_meanF, var_ML_meanF, variant_class, aa_substitutions, n_aa_substitutions)]

bc_dbl <- merge(bc_bind[n_aa_substitutions==2 & variant_class == ">1 nonsynonymous",], bc_expr[n_aa_substitutions==2 & variant_class == ">1 nonsynonymous",], sort=F)

#make columns breaking up the two substitutions
bc_dbl[,mut1 := strsplit(aa_substitutions, split=" ")[[1]][1], by=c("library","barcode")]
bc_dbl[,mut2 := strsplit(aa_substitutions, split=" ")[[1]][2], by=c("library","barcode")]

#change mutations to be Spike numbering from RBD numbering
bc_dbl[,mut1 := mutants[mutation_RBD==mut1,mutation],by=c("library","barcode")]
bc_dbl[,mut2 := mutants[mutation_RBD==mut2,mutation],by=c("library","barcode")]

#pull single mutant effects into table
bc_dbl[,c("mut1_bind","mut2_bind") := list(mutants[mutation==mut1,bind_avg],mutants[mutation==mut2,bind_avg]),by=c("library","barcode")]
bc_dbl[,c("mut1_expr","mut2_expr") := list(mutants[mutation==mut1,expr_avg],mutants[mutation==mut2,expr_avg]),by=c("library","barcode")]

```

Next, let's take a look at the distribution of these double mutant binding phenotypes, and their relation to the component single mutational effects. We can see that for many double mutant barcodes, the sum of component singles predicts the double mutant phenotype quite nicely, both for binding and even for expression. Let's focus on the binding phenotypes, as they are better correlated and have less weird shape components (e.g. the censoring is a much tighter/defined boundary, compared to the loose scatter boundary in expression, and we don't have those annoying false-negative observed phenotypes like we do with expression, since these are weeded out by the RBD+ sort). The green lines on the binding plot describe the positive epistasis cutoffs we use in the next section.

```{r pairwise_deviations_plots, fig.width=8.5, fig.height=4.5, fig.align="center",dpi=300,dev="png",echo=F}
par(mfrow=c(1,2))
plot(bc_dbl$mut1_bind + bc_dbl$mut2_bind, bc_dbl$delta_log10Ka,xlab="sum of component single mutations",ylab="double mutant barcode direct phenotype",pch=16,col="#00000020",main="binding")
abline(a=0,b=1,col="red",lty=2,lwd=2)
abline(a=1,b=1,col="green",lty=2,lwd=2)
abline(h=-2,col="green",lty=2,lwd=2)

plot(bc_dbl$mut1_expr + bc_dbl$mut2_expr, bc_dbl$delta_ML_meanF,xlab="sum of component single mutations",ylab="double mutant barcode direct phenotype",pch=16,col="#00000020",main="expression")
abline(a=0,b=1,col="red",lty=2,lwd=2)

```

We might be primarily interested in *positive* epistasis (that is, observed double mutants that bind better than predicted from the component single mutations), and we probably don't care about positive epistasis if the double mutant is still severely deleterious. Therefore, let's check out double mutant combinations whose observed binding phenotype is > -2 that exhibit positive epistasis in which the double mutant binds with delta log<sub>10</sub>(*K*<sub>A,app</sub>) of at least 1 units stronger than predicted from the component singles. (So, the difference in observed - predicted binding is >1). These are points in the plot above in the upper-left quadrant defined by the dashed green lines. 

We see many of these positive epistatic interactions involve prolines and cysteines (e.g. G416P/N422L), which may exhibit epistasis because these larger structural changes engineered by proline and cysteine changes can dramatically alter nearby amino acid preferences. We also see some simple epistatic interactions, such as R355L/D398M (which we actually see in >1 barcode, even), which correspond to a salt bridge interaction, where losing one is deleterious, but substituting both to aliphatic residues compensates this deleterious effect. We also see some interactions between mutations in the RBM and at the ACE2 interface, such as R454S+P491V, and N437K/F497S.

```{r identify_positive_epistasis, echo=F}
#define epistasis metric from observed - predicted binding phenotype
bc_dbl[,epistasis_bind := delta_log10Ka - (mut1_bind+mut2_bind),by=c("library","barcode")]

#table sorted by positive epistasis score, for dbl mutant bcs with measured binding no worse than -2 log10Ka units
kable(bc_dbl[epistasis_bind > 1 & delta_log10Ka > -2,.(avgcount,mut1,mut2,mut1_bind,mut2_bind,delta_log10Ka,epistasis_bind,mut1_expr,mut2_expr,delta_ML_meanF)][order(epistasis_bind,decreasing=T),])

```

Let's see if there's an enrichment of positive epistasis among close contact positions. We use bio3d to return all pairwise distances between RBD residues, and populate our table with the contact distance for the residue pair mutated in each double mutant. We then look at the relationship between epistasis scores and pairwise distance, for all scores, and for those where the observed double mutant delta log<sub>10</sub>(*K*<sub>A,app</sub>) is > -3 (to avoid weirdness with censored/boundary observations). (Expectation from prior DMS work is that negative epistasis is often global and dispersed in terms of residue pair distance, whereas positive epistasis is enriched for spatially proximal, specific compensatory interactions.)

```{r epistasis_versus_residue_pair_distance}
#ouput residue numbers for the mutant pair
bc_dbl[,c("site1","site2") := list(paste(strsplit(mut1,split="")[[1]][2:4],collapse=""),paste(strsplit(mut2,split="")[[1]][2:4],collapse="")),by=c("library","barcode")]

#read in RBD structure
pdb <- read.pdb(file="data/structures/ACE2-bound/6m0j.pdb")
pdb_atoms <- pdb$atom

#make data frame giving pairwise distances -- either Calpha distances, or closest atomic distances
pdb_dists <- expand.grid(site1=RBD_sites$site_SARS2,site2=RBD_sites$site_SARS2)
pdb_dists <- pdb_dists[order(pdb_dists$site1, pdb_dists$site2),]
pdb_dists <- pdb_dists[pdb_dists$site1 < pdb_dists$site2,]

calc.dist <- function(x1,y1,z1,x2,y2,z2){ #function to calculate 3D distance from xyz coordinates
  return(sqrt((x2-x1)^2+(y2-y1)^2+(z2-z1)^2))
}

for(i in 1:nrow(pdb_dists)){
  if(pdb_dists[i,"site1"] %in% unique(pdb_atoms[pdb_atoms$chain=="E" & pdb_atoms$type=="ATOM","resno"]) & pdb_dists[i,"site2"] %in% unique(pdb_atoms[pdb_atoms$chain=="E" & pdb_atoms$type=="ATOM","resno"])){
    atoms1 <- pdb_atoms[pdb_atoms$chain=="E" & pdb_atoms$resno==pdb_dists[i,"site1"],]
    atoms2 <- pdb_atoms[pdb_atoms$chain=="E" & pdb_atoms$resno==pdb_dists[i,"site2"],]
    if(nrow(atoms1)>0 & nrow(atoms2)>0){
      pdb_dists$CA_dist[i] <- calc.dist(atoms1[atoms1$elety=="CA","x"],atoms1[atoms1$elety=="CA","y"],atoms1[atoms1$elety=="CA","z"],
                                        atoms2[atoms2$elety=="CA","x"],atoms2[atoms2$elety=="CA","y"],atoms2[atoms2$elety=="CA","z"])
      all_dists <- c()
      for(i1 in 1:nrow(atoms1)){for(i2 in 1:nrow(atoms2)){
        all_dists <- c(all_dists,calc.dist(atoms1[i1,"x"],atoms1[i1,"y"],atoms1[i1,"z"],atoms2[i2,"x"],atoms2[i2,"y"],atoms2[i2,"z"]))}}
      pdb_dists$min_dist[i] <- min(all_dists)
    }else{pdb_dists$CA_dist[i] <- NA;pdb_dists$min_dist[i] <- NA}
  }else{pdb_dists$CA_dist[i] <- NA;pdb_dists$min_dist[i] <- NA}
}
pdb_dists <- data.table(pdb_dists)

#add distances to dbl mut dataframe
bc_dbl$CA_dist <- sapply(1:nrow(bc_dbl), function(x) return(pdb_dists[site1==bc_dbl[x,site1] & site2==bc_dbl[x,site2],CA_dist]))
bc_dbl$min_dist <- sapply(1:nrow(bc_dbl), function(x) return(pdb_dists[site1==bc_dbl[x,site1] & site2==bc_dbl[x,site2],min_dist]))


```

In the plots below, we can see that plenty of positive epistatic pairs are distributed quite far in the structure, suggesting nonspecific effects (or noise). But plenty, including some we saw in the tables above, are at close 3D contact!

```{r plot_epistasis_v_residue_pair_distance, fig.width=6, fig.height=7, fig.align="center",dpi=300,dev="png",echo=F}
par(mfrow=c(2,2))
plot(bc_dbl$CA_dist, bc_dbl$epistasis_bind, pch=16, col="#00000020",xlab="residue pair Calpha distance (A)",ylab="epistatic deviation, binding",main="all doubles")
plot(bc_dbl$min_dist, bc_dbl$epistasis_bind, pch=16, col="#00000020",xlab="residue pair atomic distance (A)",ylab="epistatic deviation, binding",main="all doubles")

plot(bc_dbl[delta_log10Ka > -3, CA_dist], bc_dbl[delta_log10Ka > -3, epistasis_bind], pch=16, col="#00000020",xlab="residue pair Calpha distance (A)",ylab="epistatic deviation, binding",main="doubles with measured bind > -3")
plot(bc_dbl[delta_log10Ka > -3, min_dist], bc_dbl[delta_log10Ka > -3, epistasis_bind], pch=16, col="#00000020",xlab="residue pair atomic distance (A)",ylab="epistatic deviation, binding",main="doubles with measured bind > -3")


```

Next, let's subset the double mutant info for just the RBM residues, which are focal for ACE2 interaction and also exhibit the most diversity across RBD evolution.

```{r plot_epistasis_v_residue_pair_distance_RBM, fig.width=6, fig.height=10, fig.align="center",dpi=300,dev="png",echo=F}
bc_dbl_RBM <- bc_dbl[site1 %in% RBD_sites[RBM==T,site_SARS2] & site2 %in% RBD_sites[RBM==T,site_SARS2],]

par(mfrow=c(3,2))
plot(bc_dbl_RBM$mut1_bind + bc_dbl_RBM$mut2_bind, bc_dbl_RBM$delta_log10Ka,xlab="sum of component single mutations",ylab="double mutant barcode direct phenotype",pch=16,col="#00000020",main="binding, RBM")
abline(a=0,b=1,col="red",lty=2,lwd=2)
#abline(a=1,b=1,col="green",lty=2,lwd=2)
#abline(h=-2,col="green",lty=2,lwd=2)

plot(bc_dbl_RBM$mut1_expr + bc_dbl_RBM$mut2_expr, bc_dbl_RBM$delta_ML_meanF,xlab="sum of component single mutations",ylab="double mutant barcode direct phenotype",pch=16,col="#00000020",main="expression, RBM")
abline(a=0,b=1,col="red",lty=2,lwd=2)

plot(bc_dbl_RBM[, CA_dist], bc_dbl_RBM[, epistasis_bind], pch=16, col="#00000020",xlab="residue pair Calpha distance (A)",ylab="epistatic deviation, binding",main="all RBM doubles")
plot(bc_dbl_RBM[, min_dist], bc_dbl_RBM[, epistasis_bind], pch=16, col="#00000020",xlab="residue pair atomic distance (A)",ylab="epistatic deviation, binding",main="all RBM doubles")

plot(bc_dbl_RBM[delta_log10Ka > -3, CA_dist], bc_dbl_RBM[delta_log10Ka > -3, epistasis_bind], pch=16, col="#00000020",xlab="residue pair Calpha distance (A)",ylab="epistatic deviation, binding",main="RBM doubles with measured bind > -3")
plot(bc_dbl_RBM[delta_log10Ka > -3, min_dist], bc_dbl_RBM[delta_log10Ka > -3, epistasis_bind], pch=16, col="#00000020",xlab="residue pair atomic distance (A)",ylab="epistatic deviation, binding",main="RBM doubles with measured bind > -3")


```

Let's look at the RBM residue pairs with high positive epistasis scores. We can see that many of the positive epistatic interactions surround the cysteines at positions 480 and 488 which form a key disulfide that stabilizes one of the lateral loops of the RBM. We can see that double cysteine mutants are less deleterious than expected by their single mutations, as we'd expect (knocking out the second cysteine when the other is already gone is not going to incur the same binding cost as the initial breaking of this disulfide). Another cool thing to see is that positive epistasis also emerges by knockout of one cysteine, followed by knock-in of a cysteine at a *different* position in this loop, e.g Q474C or S477C.

```{r RBM_positive_epistasis, echo=F}
#table sorted by positive epistasis score, for dbl mutant bcs w/i RBM 
kable(bc_dbl_RBM[epistasis_bind > 1 & delta_log10Ka > -3, .(avgcount,mut1,mut2,CA_dist,mut1_bind,mut2_bind,delta_log10Ka,epistasis_bind,mut1_expr,mut2_expr,delta_ML_meanF)][order(epistasis_bind,decreasing=T),])

```

These are all just cool little observations -- not sure if we want to dig in further on any of this...