exp1_clusteringComparison.Rmd

---
title: "Cluster comparison of spatial spleen data"
author: "Sara Gosline"
date: "`r Sys.Date()`"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(reticulate)
library(BayesSpace)
library(ggplot2)
library(dplyr)
library(tidyr)

library(RColorBrewer)
library(cowplot)
  
```

## Format data
First grab data and format into single table. Evaluate distributions of values. There are clearly different distributions when we median center the data.

```{r get data, message=F,warning=F}
library(dplyr)
library(purrr)
library(devtools)
#library(MSnSet.utils)
if(!require(MSnSet.utils))
  devtools::install_github('PNNL-Comp-Mass-Spec/MSnSet.utils')

library(ggplot2)
library(ggfortify)
library(cowplot)

source('spleenDataFormatting.R')

```

Now we have a sense of how the data are distributed. Obviously this varies based on the normalization that we use in the original data. This will help guide our future analysis.


## PCA
Here we will use leverage the scater package to to compare how samples are distributed based on metadata.

```{r pca sce}
library(scater)

#define function do do PCA on sce object?
pumap<-scater::runUMAP(sc.prot)%>%
  scater::runPCA()

phumap<-scater::runUMAP(sc.phos)%>%
  scater::runPCA()

##now we can plot each of them to see how they compare based on tissue annotation.
p1<-plotPCA(pumap,color_by='pulpAnnotation')+ggtitle("Proteomics PCA")
p2<-plotUMAP(pumap,color_by='pulpAnnotation')+ggtitle("Proteomics UMAP")

p3<-plotPCA(phumap,color_by='pulpAnnotation')+ggtitle("Phosphoproteomic PCA")
p4<-plotUMAP(phumap,color_by='pulpAnnotation')+ggtitle("Phosphoproteomic UMAP")

p5<-cowplot::plot_grid(p1,p2,p3,p4)
ggsave('spleenUnsupervisedPlot.png',p5)
p5
```

There are notable differences in the clustering - the green are outliers in the PCA plot, while the white separate out on the second principal component. the UMAP is interesting as it clusters things distinctly.


## KMeans clustering

Determine the the optimal number of clusters and assign each sample toa  cluster, then re-plot PCA with cluster labels. Here we try to use the most basic form of clustering by values alone - both protein and phosphoprotein.


```{r kmeans sce}
library(scran)

plotClusters<-function(psce,prefix='Proteomics'){
  
  
  p1<-plotPCA(psce,color_by='rawClust')+ggtitle(paste(prefix,"PCA"))
  p2<-plotUMAP(psce,color_by='rawClust')+ggtitle(paste(prefix,"UMAP"))
  p3<-plotPCA(psce,color_by='pcaClust')+ggtitle(paste(prefix,"PCA"))
  p4<-plotUMAP(psce,color_by='umapClust')+ggtitle(paste(prefix,"UMAP"))
  
  p5<-cowplot::plot_grid(p1,p2,p3,p4)
  ggsave(paste0(prefix,'spleenUnsupervisedClustering.png'),p5)
  p5

}

clusterData<-function(psce){
  
  new.annote<-data.frame(rawClust=scran::clusterCells(psce,assay.type='logcounts'),
                         umapClust=scran::clusterCells(psce,use.dimred='UMAP'),
                         pcaClust=scran::clusterCells(psce,use.dimred='PCA'))

  colData(psce)<-cbind(colData(psce),new.annote)

    psce
}

pumap<-clusterData(pumap)
phumap<-clusterData(phumap)

pp<-plotClusters(pumap)
pph<-plotClusters(phumap,'Phosphoproteomics')

```


The proteomics data clusters more cleanly when dimensionality reduction is applied:
```{r plot prot}
pp
```

However the phosphoproteomics data is a bit less robust

```{r plot phos}
pph
```

We can also move to bayes space to see how they compare

## BayesSpace Cluster and plotting function

We put all the functionality into a single function for now.
Here we do preprocessing, which works, cluster analysis, the clustering. We iterate through 1-4 clusters, but generally always use 3 since that is what we are expecting. However, it seems that 4 also works.  

```{r spatial process, echo=FALSE}

bayesSpaceClusterPlot<-function(sce,prefix){
  
  library(BayesSpace)
  #pre-process data
  set.seed(102)
  panc <- spatialPreprocess(sce, platform="ST", skip.PCA=TRUE,
                              n.PCs=4, n.HVGs=500, log.normalize=FALSE)

  ##identify optimal number of cluseters
  panc <- qTune(panc, qs=seq(1, 8), platform="ST", d=15,use.dimred='UMAP')
  qPlot(panc)
  
  ##get optimal from data above. how do i do this?
  ll<-panc@q.logliks
  optimal<-ll$q[which(ll$loglik==min(ll$loglik,na.rm=T))]
  
  panc1 <- spatialCluster(panc, q=optimal, platform="ST", d=15,
                           init.method="kmeans", model="normal", gamma=2,
                           nrep=2000, burn.in=200,
                           save.chain=TRUE)
  panc3 <- spatialCluster(panc, q=3, platform="ST", d=15,
                           init.method="kmeans", model="normal", gamma=2,
                           nrep=2000, burn.in=200,
                           save.chain=TRUE)
  
    panc4 <- spatialCluster(panc, q=4, platform="ST", d=15,
                           init.method="kmeans", model="normal", gamma=2,
                           nrep=2000, burn.in=200,
                           save.chain=TRUE)
  
  
  #pance2 <-spatialEnhance(panc1,q=optimal,d=15)  

  
  cols<- RColorBrewer::brewer.pal(10,'Dark2')#[c(1,2,3,6,4,5)]

  c0=clusterPlot(panc,
              palette=cols,
              label='pulpAnnotation')+
    ggtitle(paste(prefix,'Data'))
    
    c1=clusterPlot(panc1,
              palette=cols)+
    ggtitle(paste(prefix,'data with optimal clusters'))
    
  
    c2=clusterPlot(panc3,
              palette=cols)+
    ggtitle(paste(prefix,'data with 3 clusters'))
    
    
    c3=clusterPlot(panc4,
              palette=cols)+
    ggtitle(paste(prefix,'data with 4 clusters'))
  
  
  res<-cowplot::plot_grid(c0,c1,c2,c3)
  
  return(res)

}


#do this for proteomics and phosphoproteomics data
p1<-bayesSpaceClusterPlot(pumap,"Global")

ggsave('globalBayesSpaceResults.png',p1)

ph1<-bayesSpaceClusterPlot(phumap,"Phospho")
ggsave('phosphoBayesSpaceResults.png',ph1)

```

This is not really wroking that well?!!?

## Compare clustering algorithms and labels

```{r cluster, message=F,warning=F}

clust<-colData(pumap)%>%
  as.data.frame()%>%
  dplyr::select(pulpAnnotation,rawClust,umapClust,pcaClust)

##just do this manually
cor(apply(clust,2,function(x) as.numeric(as.factor(x))),use='pairwise.complete.obs')

```