full_VAMS.Rmd

---
title: "VAMS_all_solvent"
author: "Philippine Louail"
date: "2024-04-22"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r packages, message=FALSE, warning=FALSE}
library(MsExperiment)
library(xcms)
library(Spectra)
library(RColorBrewer)
library(pander)
library(readxl)
library(MetaboCoreUtils)
library(pheatmap)
library(MsBackendSql)
library(readxl)
library(Biobase)
library(SummarizedExperiment)
library(openxlsx)
library(vioplot)
```

```{r parallel-process}
#' Set up parallel processing using 2 cores
if (.Platform$OS.type == "unix") {
    register(MulticoreParam(2))
} else
    register(SnowParam(2))
```

# Introduction

Here I will evaluate the effect of different solvent used for extraction and 
their effect on the metabolite profile of the samples. 
As both VAMS and DBS samples were run together, I separate the datasets by
devices and I will have one Rmd file for each.

This markdown is for the VAMS device.
I will compare the results between the solvent throughout the analysis.

Below we load the data with its respective phenodata

```{r load_data}
#' No Phenodata - to be added
MZML_PATH <- getwd()
pd <- read_xls("phenodata.xls") |>
    as.data.frame()

full <- readMsExperiment(paste0(MZML_PATH, "/", pd$file), sampleData = pd)
full

sampleData(full)|> 
  as.data.frame() |>
  pandoc.table(style = "simple", caption = "Samples from the data set.")
```

#VAMS 

This dataset recorded both VAMS and VAMS so I will filter the data to only keep
the VAMS samples. 

```{r}
# Extract each sovent of VAMS
full <- full[sampleData(full)$sample_type == "VAMS"]
sampleData(full) |> 
  as.data.frame() |>
  pandoc.table(style = "simple", caption = "Samples from the data set.")

dir <- "VAMS_results/full/"
dir.create(dir, recursive = TRUE, showWarnings = FALSE)
```

```{r quick-checks}
 #' Retention time range for entire dataset 
spectra(full) |>
rtime() |>
range()

# Check number of samples 
length(full)

#' Check Ms level
spectra(full) |> 
    msLevel() |>
    split(fromFile(full)) |>
    lapply(table)
```

```{r}
#' Define colors for the different sample role
leg_sample <- brewer.pal(8, name = "Dark2")[c(2, 8)]
names(leg_sample) <- unique(sampleData(full)$sample_role)
col_sample <- leg_sample[sampleData(full)$sample_role]

#' Define colors for the differen solvent
leg_solvent <- brewer.pal(8, name = "Dark2")[c(3, 5, 6, 7)]
names(leg_solvent) <- unique(sampleData(full)$solvent)
col_solvent <- leg_solvent[sampleData(full)$solvent]
```

Below we compute and plot the BPC of the dataset.

```{r plot_bpc}
#' filter the beginning ofthe retention time 
full <- filterRt(full, c(40, 840))
#' First extract and plot bpc 
bpc <- chromatogram(full, aggregationFun = "max", msLevel = 1, chunkSize = 2)

plot(bpc, main = "BPC", col = col_sample, 
                    lwd = 1)
grid()
legend("topright", col = leg_sample,
       legend = names(leg_sample), lty = 1, horiz = TRUE, bty = "n")
```

BPC comments: 
There seem to be contamination throughout the entire RT range but especially 
after 500s. This needs to be taken in account in the analysis. 

I import a known compound list (curation of annotated ions from MsDial using 
HMDB database). I will use these compounds all throughout the analysis to
evaluate the different preprocessing steps.

I then extract eics for each of these compounds and evaluate them regarding: 

- peak shape, width
- intensity variation related to sample-type,... 
- retention time variation

```{r include=FALSE}
#import known_compound list 
compounds <- read_xlsx("Annotation_List.xlsx") |>
    as.data.frame()

# plot eic and coloring per solvent - see if I need to extend rtmin and rtmax 
# full
eics <- chromatogram(
    full,
    rt = as.matrix(compounds[, c("rtmin", "rtmax")]),
    mz = as.matrix(compounds[, c("mzmin", "mzmax")]), msLevel = 1, chunkSize = 2)

fData(eics)$mz <- compounds$Average_Mz
fData(eics)$rt <- compounds$Average_Rt
fData(eics)$name <- compounds$Metabolite_Name
rownames(eics) <- compounds$Metabolite_Name

tmpdr <- paste0(dir, "raw/")
dir.create(tmpdr, recursive = TRUE, showWarnings = FALSE)
for (i in seq_len(nrow(compounds))) {
    png(paste0(tmpdr, "EIC_", fData(eics)$name[i], ".png"),
        width = 12, height = 8, units = "cm", res = 600, pointsize = 4)
    plot(eics[i, ], main = fData(eics)$name[i],
         col = paste0(col_sample, 80))
    grid()
    legend("topright", col = leg_sample,
           legend = names(leg_sample), lty = 1)
    abline(v = fData(eics)$rt[i], col = "red", lty = 3)
    dev.off()
}
```

There is wide intensity variation between the study samples. Not related to 
solvents. This also needs to be taken in account in the rest of the analysis

# preprocessing

The parameters for peak picking are based on previous data that I received and
tested using the list of known ions given by the laboratory. 

The parameters of the CentWaveParam method are the followings:

- `peakwidth`: the expected peak width in seconds. Here looking at our internal
  standard and other peaks in the dataset we estimate that they are between 10 
  to 20 second wide 
- `ppm`: The accepted m/z deviation in ppm. We set it to 50 ppm.
- some peaks did not have enough datapoints and therefor  we use 
  `extendLengthMSW = TRUE` to extend these signals.

```{r}
#' Peakpicking
param <- CentWaveParam(peakwidth = c(10, 20), ppm = 50, integrate = 2, 
                       snthresh = 5, extendLengthMSW = TRUE)

full <- findChromPeaks(full, param = param, chunkSize = 2L)
```

Again i will plot the EICs to observe how well the peak picking went.

```{r eic2, include=FALSE}
#full
eics <- chromatogram(
    full,
    rt = as.matrix(compounds[, c("rtmin", "rtmax")]),
    mz = as.matrix(compounds[, c("mzmin", "mzmax")]),
    msLevel = 1,
    chunkSize = 2)

fData(eics)$mz <- compounds$Average_Mz
fData(eics)$rt <- compounds$Average_Rt
fData(eics)$name <- compounds$Metabolite_Name
rownames(eics) <- compounds$Metabolite_Name

tmpdr <- paste0(dir, "chrompeaks/")
dir.create(tmpdr, recursive = TRUE, showWarnings = FALSE)
for (i in seq_len(nrow(compounds))) {
    png(paste0(tmpdr, "EIC_", fData(eics)$name[i], ".png"),
        width = 12, height = 8, units = "cm", res = 600, pointsize = 4)
    plot(eics[i, ], main = fData(eics)$name[i],
         col = paste0(col_sample, 80),
         peakBg = paste0(col_sample, 40)[chromPeaks(eics[i])[, "column"]])
    grid()
    legend("topright", col = leg_sample,
           legend = names(leg_sample), lty = 1)
    abline(v = fData(eics)$rt[i], col = "red", lty = 3)
    dev.off()
}
```

Refinement: Merge neighboring peaks. Remove artifacts that can be created
during peak picking. Especially necessary when not enough MS1 data point (which 
is the cases here for some peaks)

- `expandRt` and `expandMz` are the maximum allowed distance between two peaks
  in the retention time and m/z dimensions, respectively. We set them up based
  on the peakwidth set up from the peak picking step and observing how well the
  peak picking step performed. 
- The other parameters are left as default and their definitions can be found 
  in `?MergeNeighboringPeaksParam` documentation

```{r}
param <- MergeNeighboringPeaksParam(expandRt = 10,
                                    expandMz = 0.01,
                                    ppm = 10,
                                    minProp = 0.75)

full <- refineChromPeaks(full, param = param, chunkSize = 2)

chromPeakData(full)$merged |>
                      table()
```

```{r include=FALSE}
eics <- chromatogram(
    full,
    rt = as.matrix(compounds[, c("rtmin", "rtmax")]),
    mz = as.matrix(compounds[, c("mzmin", "mzmax")]),
    msLevel = 1,
    chunkSize = 2)

fData(eics)$mz <- compounds$Average_Mz
fData(eics)$rt <- compounds$Average_Rt
fData(eics)$name <- compounds$Metabolite_Name
rownames(eics) <- compounds$Metabolite_Name

tmpdr <- paste0(dir, "refine/")
dir.create(tmpdr, recursive = TRUE, showWarnings = FALSE)
for (i in seq_len(nrow(compounds))) {
    png(paste0(tmpdr, "EIC_", fData(eics)$name[i], ".png"),
        width = 12, height = 8, units = "cm", res = 600, pointsize = 4)
    plot(eics[i, ], main = fData(eics)$name[i],
         col = paste0(col_sample, 80),
         peakBg = paste0(col_sample, 40)[chromPeaks(eics[i])[, "column"]])
    grid()
    legend("topright", col = leg_sample,
           legend = names(leg_sample), lty = 1)
    abline(v = fData(eics)$rt[i], col = "red", lty = 3)
    dev.off()
}
```

Alignment: Very little variation in RT between samples, but always good to run
it.

We first need to run a correspondence analysis using all samples. This will
allow us to select the best anchor peaks for the alignment. For this we set
`minFraction = 1` meaning that we want to keep peaks that are present in 
all samples to base the alignment on these.

```{r}
#' perform quick correspondence analysis - do not take in account Blanks
f <- factor(sampleData(full)$solvent, 
            levels = unique(sampleData(full)$solvent))
idx_B <- sampleData(full)$sample_role == "Blank"
f[idx_B] <- NA

param <- PeakDensityParam(sampleGroups = f,
                          minFraction = 1, 
                          binSize = 0.01, ppm = 10,
                          bw = 2)

full <- groupChromPeaks(full, param = param)

#' align the data 
param <- PeakGroupsParam(minFraction = 0.75, extraPeaks = 50, span = 0.5)

#' Input in the function
full <- adjustRtime(full, param = param)

#' See retention time variation
plotAdjustedRtime(full, col = paste0(col_sample, 80), peakGroupsPch = 1)
grid()
legend("topright", col = leg_sample,
       legend = names(leg_sample), lty = 1, bty = "n")

full <- applyAdjustedRtime(full)
```

We evaluate the efficacy of the alignment on our internal standard.

```{r include=FALSE}
eics <- chromatogram(
    full,
    rt = as.matrix(compounds[, c("rtmin", "rtmax")]),
    mz = as.matrix(compounds[, c("mzmin", "mzmax")]),
    msLevel = 1,
    chunkSize = 2)

fData(eics)$mz <- compounds$Average_Mz
fData(eics)$rt <- compounds$Average_Rt
fData(eics)$name <- compounds$Metabolite_Name
rownames(eics) <- compounds$Metabolite_Name

tmpdr <- paste0(dir, "aligned/")
dir.create(tmpdr, recursive = TRUE, showWarnings = FALSE)
for (i in seq_len(nrow(compounds))) {
    png(paste0(tmpdr, "EIC_", fData(eics)$name[i], ".png"),
        width = 12, height = 8, units = "cm", res = 600, pointsize = 4)
    plot(eics[i, ], main = fData(eics)$name[i],
         col = paste0(col_sample, 80),
         peakBg = paste0(col_sample, 40)[chromPeaks(eics[i])[, "column"]])
    grid()
    legend("topright", col = leg_sample,
           legend = names(leg_sample), lty = 1)
    abline(v = fData(eics)$rt[i], col = "red", lty = 3)
    dev.off()
}
```

Correspondence step: 

Correspondence is the step were features are defined based on how often a peak 
is repeated in the dataset. We set up the following parameters:

- `sampleGroups`: the factor to group the samples. Here we remove blanks from
  being considered in this step. And separate the dataset per triplicates of
  solvents.
- `minFraction`: the minimum fraction of samples of a certain device in which a
  peak must be present to be considered a feature. We set it to 2/3. which mean
  in this case that a peak must be present in at least 2 out of 3 samples of
  a device to be considered a feature.
- `bw`: the bandwidth of the kernel density estimation. We set it to 2 after
  testing a checking that is it wide enough to capture the peaks for one
  feature.

```{r}
# correspondence - use same factor as before
param <- PeakDensityParam(sampleGroups = f,
                        minFraction = 2/3, binSize = 0.015, bw = 2.0, ppm = 10) 

plotChromPeakDensity(
    eics["1_18_1_lysophosphatidylcholine"], param = param,
    col = paste0(col_sample, "80"),
    peakCol = col_sample[
        chromPeaks(eics["1_18_1_lysophosphatidylcholine"])[, "column"]],
    peakBg = paste0(col_sample[
        chromPeaks(eics["1_18_1_lysophosphatidylcholine"])[, "column"]], 20),
    peakPch = 16)

plotChromPeakDensity(eics["LysoPhosphatidylcholine_16_0"], param = param,
    col = paste0(col_sample, "80"),
    peakCol = col_sample[
        chromPeaks(eics["LysoPhosphatidylcholine_16_0"])[, "column"]],
    peakBg = paste0(
        col_sample[chromPeaks(
            eics["LysoPhosphatidylcholine_16_0"])[, "column"]], 20),
    peakPch = 16)

# the tests look great, lets apply to entire dataset
full <- groupChromPeaks(full, param = param)

# how many features:
nrow(featureDefinitions(full))
```

We will now use gapfilling to fill in missing values in the dataset. We will
use the `ChromPeakAreaParam` method with defualt parameters for this.

```{r}
#' gap filing
#' Number of missing values
sum(is.na(featureValues(full)))

full <- fillChromPeaks(full, param = ChromPeakAreaParam(), chunkSize = 2)

#' How many missing values after
sum(is.na(featureValues(full)))
```

Extract the intensity values for the features and save the data. 

```{r}
#' Extract results as a SummarizedExperiment
library(SummarizedExperiment)

res_full <- quantify(full, method = "sum", filled = FALSE)
assays(res_full)$raw_filled <- featureValues(full, method = "sum",
                                        filled = TRUE )
```

# flag features in blanks 

We will now flag features that are highly present in blanks. We will use the 
`filterFeatures` with the `BlankFlag` method to do this. By setting up a
`threshold = 2` we will flag the features that have an intensity in the blanks
that is at least half the intensity in the study samples.

We finalize preprocessing by generating a summarizedExperiment object. We also 
remove the previously flagged features from the data for subsequent analysis.

```{r}
idx <- sampleData(full)$sample_role == "Blank"
full <- filterFeatures(full, BlankFlag(blankIndex = idx, qcIndex = !idx))

featureDefinitions(full)$possible_contaminants[is.na(featureDefinitions(full)$possible_contaminants)] <- FALSE
featureDefinitions(full)$possible_contaminants <- as.logical(featureDefinitions(full)$possible_contaminants)

# we actually remove them for downstream analysis in the sumexp object
nrow(res_full)
res_full <- res_full[!featureDefinitions(full)$possible_contaminants, ]
nrow(res_full) # check that it is working

de <- list(values = as.data.frame(featureValues(full)), definitions = featureDefinitions(full))
write.xlsx(de, paste0(dir, "feature_results.xlsx"), rowNames = TRUE)

# just remove blanks also 
full 
full <- full[sampleData(full)$sample_role != "Blank", keepFeatures = TRUE]
full

res_full
res_full <- res_full[, res_full$sample_role != "Blank"]
res_full
```

Save files at the end of preprocessing

```{r}
save(res_full, file = paste0(dir, "SumExp_full_preprocessing.RData"))
save(full, file = paste0(dir, "full_preprocessing.RData"))
load(paste0(dir, "full_preprocessing.RData"))
load(paste0(dir, "SumExp_full_preprocessing.RData"))
```

# Noise comparison: we will compare the amount of noise between the solvents.

Below we compare the noise signals between devices. We first calculate the
overall signal in the dataset and then calculate the signal that is in the
chromatographic peaks detection. We then subtract the two to get the noise
signal.

```{r}
# overall signal in the dataset 
#' - for each file calculate the sum of intensities 
background  <- spectra(full) |>
    split(fromFile(full)) |>
    lapply(tic) |>
    lapply(sum) |>
    unlist()

# Overall signal that is in the chromatographic peaks detection 
    # - check "into" definition first, mioght need to multiply it by something
detected <- apply(assay(res_full), 2, function(x) sum(x, na.rm = TRUE))

# substract and plot ? Also i'm removing blanks bc i think we don't need it 
names(background) <- names(detected) <- res_full$solvent
#remove blanks
noise <- background - detected

f <- factor(names(noise), levels = unique(names(noise)))
group <- split(noise, f)

plot(NULL, xlim = c(1, length(group)), ylim = range(unlist(group)), 
     xaxt = "n", xlab = "Solvents", ylab = "Noise", 
     main = "Noise comparison between solvents")
for (i in seq_along(group)) {
  points(rep(i, length(group[[i]])), group[[i]], pch = 19, col = leg_solvent[i])
}
axis(1, at = seq_along(group), labels = names(group))
```

# Normalisation 

```{r extra-packages, message=FALSE, warning=FALSE}
library(ggfortify)
library(SummarizedExperiment)
library(RColorBrewer)

col_solvent <- leg_solvent[res_full$solvent]
```

We first need to evaluate the data distribution and try to see any technical
related bias.

```{r counts1, fig.height=5, fig.width=5, include=TRUE}
layout(mat = matrix(1:3, ncol = 1), height = c(0.2, 0.2, 0.8))

par(mar = c(0.2, 4.5, 0.2, 3))
barplot(apply(assay(res_full, "raw"), MARGIN = 2, 
              function(x) sum(!is.na(x))),
        col = col_solvent, ylab = "features raw data", xaxt = "n", 
        space = 0.012)
barplot(apply(assay(res_full, "raw_filled"), MARGIN = 2, 
              function(x) sum(!is.na(x))),
        col = col_solvent, ylab = "features filled data", xaxt = "n", 
        space = 0.012)
boxplot(log2(assay(res_full, "raw_filled")), xaxt = "n",
        ylab = expression(log[2]~abundance~filled~data),
        col = col_solvent, outline=FALSE, medlty = "blank", 
        border = col_solvent, varwidth = TRUE)
points(colMedians(log2(assay(res_full, "raw_filled")), 
                  na.rm = TRUE), type = "b", pch = 16) 
grid(nx = NA, ny = NULL)
legend("topright", col = leg_solvent,
       legend = names(leg_solvent), lty=1, lwd = 2, xpd = TRUE, ncol = 4, 
       cex = 0.8,  bty = "n")
```

```{r rla-plot raw and filled1, fig.cap = "RLA plot for the raw data and filled data. Note: outliers are not drawn."}
par(mfrow = c(1, 1), mar = c(0.2, 4.5, 2.5, 3))
boxplot(rowRla(assay(res_full, "raw_filled"),
               group = res_full$solvent),
        cex = 0.5, pch = 16, boxwex = 1,
        col = col_solvent, ylab = "RLA",
        border = paste0(col_solvent, 40),
        outline = FALSE, xaxt = "n", main = "Relative log abundance", 
        cex.main = 1)
grid(nx = NA, ny = NULL)
abline(h = 0, lty=3, lwd = 1, col = "black")
legend("topright", col = leg_solvent,
       legend = names(leg_solvent), lty=1, lwd = 2, xpd = TRUE, ncol = 3,
       cex = 0.8,  bty = "n")
```

It is important to note that for this dataset the samples were not randomized.
there the order of plotting above is the injection index order too. 
We can see a clear, injection related bias in the dataset. We will run a median
scaling step to see how well we can correct the variation between samples.

## quick median scaling

```{r}
#' Compute median and generate normalization factor, we compute per solvent as
#' to keep the technical variation relatedto the solvent.
assays(res_full)$norm <- assay(res_full, "raw_filled")

for (i in res_full$solvent) {
    idx <- res_full$solvent == i
    mdns <- apply(assay(res_full, "raw_filled")[, idx], 2, median, na.rm = TRUE)
    nf_mdn <- mdns / median(mdns)
    assays(res_full)$norm[, idx] <- sweep(assay(res_full, "raw_filled")[, idx], MARGIN = 2,
                                          nf_mdn, '/')
}
```

We also do not use the median from blank samples as we don't want them to 
influence our study samples. Because we separated the blanks when defining the
features (correspondence step) we will have much *less* features in blanks and 
therefore the intensity distribution for these is not reliable. 

```{r rla-plot after norm2, include = TRUE, fig.cap = "RLA plot before and after normalization. Note: outliers are not drawn.", fig.height= 7, fig.width=5.5}
par(mfrow = c(2, 1), mar = c(1, 4, 3, 1))

boxplot(rowRla(assay(res_full, "raw_filled"),
               group = res_full$solvent),
        cex = 0.5, pch = 16, boxwex = 1,
        col = col_solvent, ylab = "RLA",
        border = paste0(col_solvent, 40),
        outline = FALSE, xaxt = "n", 
        main = "Relative log abundance before normalisation", 
        cex.main = 1)
grid(nx = NA, ny = NULL)
abline(h = 0, lty=3, lwd = 1, col = "black")
legend("topright", col = leg_solvent,
       legend = names(leg_solvent), lty=1, lwd = 2, xpd = TRUE, ncol = 3,
       cex = 0.8,  bty = "n")
boxplot(rowRla(assay(res_full, "norm"),
               group = res_full$solvent),
        cex = 0.5, pch = 16, boxwex = 1,
        col = col_solvent, ylab = "RLA",
        border = paste0(col_solvent, 40),
        outline = FALSE, xaxt = "n", 
        main = "Relative log abundance after normalization",
        cex.main = 1, ylim = c(-1.0 , 2.0))
grid(nx = NA, ny = NULL)
abline(h = 0, lty=3, lwd = 1, col = "black")
legend("topright", col = leg_solvent,
       legend = names(leg_solvent), lty=1, lwd = 2, xpd = TRUE, ncol = 3,
       cex = 0.8,  bty = "n")
```

## Coeffcient of variation

The coefficient of variation (or Relative standard deviation) evaluate how 
close data is to each other. It is especially interesting in our case as we
have triplicate. 
Therefore here we compute the RSD for each feature across the samples in each
solvent.

The  RSD table below therefore give us an information on how the triplicate are
close to eachother overall per solvent. 

```{r include=TRUE, results = "asis"}
# indices of each solvents' triplicates
idx_mh  <- res_full$solvent == "MEOH_H2O" 
idx_m  <- res_full$solvent == "MEOH" 
idx_a  <- res_full$solvent == "ACN" 
idx_am <- res_full$solvent == "ACN_MEOH" 

# Compute Rsds on  all data per solvents
sample_res <- cbind(
    Raw_MH = rowRsd(assay(res_full, "raw_filled")[, idx_mh],
                    na.rm = TRUE, mad = TRUE),
    Norm_MH = rowRsd(assay(res_full, "norm")[, idx_mh],
                     na.rm = TRUE, mad = TRUE),
    Raw_M = rowRsd(assay(res_full, "raw_filled")[, idx_m],
                   na.rm = TRUE, mad = TRUE),
    Norm_M = rowRsd(assay(res_full, "norm")[, idx_m],
                    na.rm = TRUE, mad = TRUE),
    Raw_A = rowRsd(assay(res_full, "raw_filled")[, idx_a],
                   na.rm = TRUE, mad = TRUE),
    Norm_A = rowRsd(assay(res_full, "norm")[, idx_a],
                    na.rm = TRUE, mad = TRUE),
    Raw_AM = rowRsd(assay(res_full, "raw_filled")[, idx_am],
                    na.rm = TRUE, mad = TRUE),
    Norm_AM = rowRsd(assay(res_full, "norm")[, idx_am],
                     na.rm = TRUE, mad = TRUE)
)

#' Compute quantile for better data visualisation
res_df <- data.frame(
    Raw_MH = quantile(sample_res[, "Raw_MH"], na.rm = TRUE),
    Norm_MH = quantile(sample_res[, "Norm_MH"], na.rm = TRUE),
    Raw_M = quantile(sample_res[, "Raw_M"], na.rm = TRUE),
    Norm_M = quantile(sample_res[, "Norm_M"], na.rm = TRUE),
    Raw_A = quantile(sample_res[, "Raw_A"], na.rm = TRUE),
    Norm_A = quantile(sample_res[, "Norm_A"], na.rm = TRUE),
    Raw_AM = quantile(sample_res[, "Raw_AM"], na.rm = TRUE),
    Norm_AM = quantile(sample_res[, "Norm_AM"], na.rm = TRUE)
    
)
cpt <- paste0("Distribution of RSD values across samples for the raw and ",
              "normalized data.")
pandoc.table(res_df, style = "rmarkdown", caption = cpt)
```

Both the RLA plot and the CV values show that the normalization step was
successful. The RLA shows all samples median got closer to each other and 75% 
of our features have a CV below 30%  for each solvent which is great. 


# Comparison on overall data

- summary plot: the plot below is one of the summary plot that compare results
after prepossessing and normalization.

```{r fig.height=8, fig.width=6}
#' Quantile of RSD values after norm - non contaminant
res_df <- data.frame(
    MetOH_H2O = quantile(sample_res[, "Norm_MH"], na.rm = TRUE),
    ACN = quantile(sample_res[, "Norm_A"], na.rm = TRUE),
    MetOH = quantile(sample_res[, "Norm_M"], na.rm = TRUE),
    ACN_MetOH = quantile(sample_res[, "Norm_AM"], na.rm = TRUE)
)

# Intensity and missing values 
res_mh <- res_full[, res_full$solvent == "MEOH_H2O"]
res_m <- res_full[, res_full$solvent == "MEOH"]
res_a <- res_full[, res_full$solvent == "ACN"]
res_am <- res_full[, res_full$solvent == "ACN_MEOH"]

idx_fts <- cbind(
    MetOH_H2O = rowSums(is.na(assay(res_mh, "norm"))) < 2,
    ACN = rowSums(is.na(assay(res_a, "norm"))) < 2,
    MetOH = rowSums(is.na(assay(res_m, "norm"))) < 2,
    ACN_MetOH = rowSums(is.na(assay(res_am, "norm"))) < 2
)

res_mh <- res_mh[rowSums(is.na(assay(res_mh, "norm"))) < 2,]
res_m <- res_m[rowSums(is.na(assay(res_m, "norm"))) < 2,]
res_a <- res_a[rowSums(is.na(assay(res_a, "norm"))) < 2,]
res_am <- res_am[rowSums(is.na(assay(res_am, "norm"))) < 2,]

intensity <- cbind(
    MetOH_H2O = log2(as.numeric(assay(res_mh, "norm"))),
    ACN = log2(as.numeric(assay(res_a, "norm"))),
    MetOH = log2(as.numeric(assay(res_m, "norm"))),
    ACN_MetOH = log2(as.numeric(assay(res_am, "norm"))))

num_features <- cbind(
    MetOH_H2O = nrow(res_mh),
    ACN = nrow(res_a),
    MetOH = nrow(res_m),
    ACN_MetOH = nrow(res_am)
)

missing_values <- cbind(
    MetOH_H2O = sum(is.na(assay(res_mh, "raw")))/length(assay(res_mh, "raw")) * 100,
    ACN = sum(is.na(assay(res_a, "raw")))/length(assay(res_a, "raw")) * 100,
    MetOH = sum(is.na(assay(res_m, "raw")))/length(assay(res_m, "raw")) * 100,
    ACN_MetOH = sum(is.na(assay(res_am, "raw")))/length(assay(res_am, "raw")) * 100
)

#General plot - now without flagged features
layout(mat = matrix(1:3, ncol = 1), height = c(0.3, 0.3, 0.8))
par(mar = c(1, 4.5, 1, 3))
barplot(colSums(num_features),
    col = leg_solvent,
    ylab = "Number of features", space = 0.05, ylim = c(0, 5000))
barplot(c(missing_values),
        ylab = "% of missing values", col = leg_solvent, space = 0.05, ylim = c(0, 60))
vioplot(intensity, 
        ylab = "Log2 intensity", col = leg_solvent, space = 0.05)
```

```{r}
cpt <- paste0("RSD values distributionacross samples for the ",
              "normalized data for each solvent type.")
pandoc.table(res_df, style = "rmarkdown", caption = cpt)
```

- number of features / intensity per rt slices 

```{r fig.height=9, fig.width=8}
# Bin features per RT slices
vc <- rowData(res_full)$rtmed 
breaks <- seq(0, max(vc, na.rm = TRUE) + 1, length.out = 15) |> 
    round(0)
cuts <- cut(vc, breaks = breaks, include.lowest = TRUE)

table(cuts)

num_features_solvent <- apply(idx_fts, MARGIN = 2, function(x) table(cuts[x]))

idx_fts <- as.data.frame(idx_fts)

ftc <-function(res_solvent, fts_idx) {
    tmp <- rowSums(assay(res_solvent, "norm"), na.rm = TRUE)
    cuts_tmp <- cuts[fts_idx]
    t <- split(tmp, cuts_tmp) |> 
        lapply(sum, na.rm = TRUE, simplify = TRUE)
    unlist(t)
}

intensity_solvent <- list(
    MetOH_H20 = ftc(res_mh, idx_fts$MetOH_H2O),
    ACN = ftc(res_a, idx_fts$ACN),
    MetOH = ftc(res_m, idx_fts$MetOH),
    ACN_MetOH = ftc(res_am, idx_fts$ACN_MetOH)
    )

# Transform intensity to log2 scale
intensity_solvent <- lapply(intensity_solvent, log2)

# Plot
layout(mat = matrix(1:2, ncol = 1), heights = c(0.5, 0.5))
par(mar = c(0.5, 4.5, 2, 3))

# Plot number of features
ylim_features <- c(0, max(unlist(num_features_solvent)))
plot(num_features_solvent[,1], col = leg_solvent[1], ylab = "Number of features",
     xlab = "", type = "b", pch = 16, xaxt = "n", ylim = ylim_features,
     main = "Analysis along the RT axis for each solvent")
for (i in 2:ncol(num_features_solvent)) {
  lines(num_features_solvent[,i], col = leg_solvent[i], type = "b", pch = 16)
}
axis(1, at = 1:length(num_features_solvent[,1]), labels = FALSE)
grid()
legend("top", legend = names(intensity_solvent), col = leg_solvent, pch = 16, 
       cex = 1, horiz = TRUE, bty = "n")

# Plot intensity
par(mar = c(5, 4.5, 2, 3))
ylim_intensity <- range(unlist(intensity_solvent))
plot(intensity_solvent[[1]], type = "b", pch = 16, xlab = "",
     ylab = "Log2 intensity", col = leg_solvent[1], xaxt = "n", ylim = ylim_intensity)
for (i in 2:length(intensity_solvent)) {
  lines(intensity_solvent[[i]], type = "b", pch = 16, col = leg_solvent[i])
}
axis(1, at = 1:length(intensity_solvent[[1]]), 
     labels = names(intensity_solvent[[1]]), las = 2, cex.axis = 0.8)
grid()
mtext("Retention time (s)", side = 1, line = 4, cex = 1)
```

- median of median

```{r fig.height=9, fig.width=8}
ftc <-function(res_solvent, fts_idx) {
    tmp <- rowMedians(assay(res_solvent, "norm"), na.rm = TRUE)
    cuts_tmp <- cuts[fts_idx]
    t <- split(tmp, cuts_tmp) |> 
        lapply(median, na.rm = TRUE, simplify = TRUE)
    unlist(t)
}

intensity_solvent <- list(
    MetOH_H20 = ftc(res_mh, idx_fts$MetOH_H2O),
    ACN = ftc(res_a, idx_fts$ACN),
    MetOH = ftc(res_m, idx_fts$MetOH),
    ACN_MetOH = ftc(res_am, idx_fts$ACN_MetOH)
    )

# Transform intensity to log2 scale
intensity_solvent <- lapply(intensity_solvent, log2)

# Plot
layout(mat = matrix(1:2, ncol = 1), heights = c(0.5, 0.5))
par(mar = c(0.5, 4.5, 2, 3))

# Plot number of features
ylim_features <- c(0, max(unlist(num_features_solvent)))
plot(num_features_solvent[,1], col = leg_solvent[1], ylab = "Number of features",
     xlab = "", type = "b", pch = 16, xaxt = "n", ylim = ylim_features,
     main = "Analysis along the RT axis for each solvent")
for (i in 2:ncol(num_features_solvent)) {
  lines(num_features_solvent[,i], col = leg_solvent[i], type = "b", pch = 16)
}
axis(1, at = 1:length(num_features_solvent[,1]), labels = FALSE)
grid()
legend("top", legend = names(intensity_solvent), col = leg_solvent, pch = 16, 
       cex = 1, horiz = TRUE, bty = "n")

# Plot intensity
par(mar = c(5, 4.5, 2, 3))
ylim_intensity <- range(unlist(intensity_solvent))
plot(intensity_solvent[[1]], type = "b", pch = 16, xlab = "",
     ylab = "Log2 intensity", col = leg_solvent[1], xaxt = "n", ylim = ylim_intensity)
for (i in 2:length(intensity_solvent)) {
  lines(intensity_solvent[[i]], type = "b", pch = 16, col = leg_solvent[i])
}
axis(1, at = 1:length(intensity_solvent[[1]]), 
     labels = names(intensity_solvent[[1]]), las = 2, cex.axis = 0.8)
grid()
mtext("Retention time (s)", side = 1, line = 4, cex = 1)
```


- overlap of features between solvents

```{r echo=TRUE}
# Create a data frame for the UpSet plot - need to fix
upset_df <- lapply(idx_fts, as.integer)
names(leg_solvent) <- names(upset_df)
# Plot the UpSet plot
library(UpSetR)
upset(as.data.frame(upset_df), sets = c("MetOH_H2O", "ACN", "MetOH", "ACN_MetOH"),
      sets.bar.color = leg_solvent, mainbar.y.label = "Number of common features", keep.order = TRUE, mainbar.y.max= 3000)
```


```{r}
save(res_full, file = paste0(dir, "SumExp_full_normalisation.RData"))
```

# MS/MS annotation 

First need to prep spectra input 

```{r prep-spectra}
## dataset 
# get spectra data and change their backend 
idx_fts <- rownames(featureDefinitions(full)) %in% rownames(res_full)
full_spectra <- featureSpectra(full, msLevel = 2L, features = idx_fts) # not annotating for contamination
#' Remove peaks with an intensity below 5% or the spectra's BPC
low_int <- function(x, ...) {
    x > max(x, na.rm = TRUE) * 0.05
}
full_spectra <- filterIntensity(full_spectra, intensity = low_int)

length(full_spectra)
full_spectra$feature_id |>
    table() |>
    quantile()

full_spectra |>
    lengths() |>
    quantile()

#' Remove peaks with an m/z > the precursor m/z. For single-charged ions
#' no fragment peak can have an m/z >= the precursor 
full_spectra <- full_spectra |>
    filterPrecursorPeaks(mz = "==", ppm = 50)

#' Remove spectra with a single peak.
full_spectra <- full_spectra[lengths(full_spectra) > 1]

#' Add Spectra index 
full_spectra$spectra_idx <- seq_len(length(full_spectra))

full_spectra <-setBackend(full_spectra, MsBackendMemory())
full_spectra <- applyProcessing(full_spectra)

save(full_spectra, file = paste0(dir, "full_spectra.RData"))
```

```{r loadlibrary}
library(MetaboAnnotation)
library(MsBackendSql)
library(RSQLite)
```

For this analysis we use the GNPS database. We first load the database from a local sqlite file.

```{r loadgnps}
# "/home/plouail/MsBackendSql.GNPS.matchms.cleaned.v1.sqlite" for cluster
#"C:/Users/plouail/Documents/MsBackendSql.GNPS.matchms.cleaned.v1.sqlite"
#load gnps library
mb <- Spectra(file.path("/home/plouail/pilot_study/MsBackendSql.GNPS.matchms.cleaned.v1.sqlite"),
    drv = SQLite(), source = MsBackendOfflineSql())
```

Below we prepare the database spectra the same way as we prepared our own 
spectra.

```{r filtergnps, echo=TRUE}
mb <- setBackend(mb, MsBackendMemory())
low_int <- function(x, ...) {
    x > max(x, na.rm = TRUE) * 0.05
}
## phili change this to use filterRanges()
#' remove negative polarity 
mb <- mb[mb$polarity == 1]

#' Do same filtering as for our spectra data 
mb <- filterIntensity(mb, intensity = low_int)
mb <- filterPrecursorPeaks(mb, mz = "==", ppm = 50)
mb <- mb[lengths(mb) > 1]
```

We then compare our dataset spectra with the database and select matches with
the score above 0.8.

```{r matching, echo=TRUE}
#' remove parallel processing 
register(SerialParam())

#' Matching
prm <- CompareSpectraParam(ppm = 10, requirePrecursor = TRUE,
                           THRESHFUN = function(x) which(x >= 0.8)) 
mtch_full <- matchSpectra(full_spectra, mb, param = prm)
mtch_full
#' really low percentage of MS2 spectra matched.
length(whichQuery(mtch_full)) / length(mtch_full) * 100

#' for how many features do we have MS2 spectra
length(unique(mtch_full$feature_id))
    
#' Keep only the query that got matches 
mtch_full <- mtch_full[whichQuery(mtch_full)]

#' for how many features do we have MS2 spectra WITH db matches?
length(unique(mtch_full$feature_id))

# Extract results 
md_full <- matchedData(mtch_full, c("rtime", "precursorMz", "feature_id", 
                                "target_inchikey", "target_compound_name", 
                                "score", "spectra_idx"))

md_full

save(md_full, file = paste0("md_full.RData"))
```

Below we remove duplicate matches (using inchikey) and keep the keep best
scoring match for each inchikey.

```{r echo=TRUE}
rmv_duplicate <- function(md) {
    res <- lapply(split(md, md$feature_id), function(x) {
        lapply(split(x, x$target_inchikey), function(z) {
            z[which.max(z$score), ]
        }) |>
            do.call(what = rbind)
    }) |>
        do.call(what = rbind) |>
        as.data.frame()
}

md_full <- rmv_duplicate(md_full)

md_full

toberefined <- cbind(md_full, assay(res_full)[md_full$feature_id,])
write.csv(toberefined, "toberefined_VAMS_solvent.csv")
```

I need them to do some refinement on the annotation before doing proper plotting.
The plots below are just to prepare the codes, they are not the actual results !!
# plot resulting compounds

```{r eval=FALSE, include=FALSE}
tmpdr <- paste0(dir, "full/annotation/")
dir.create(tmpdr, recursive = TRUE, showWarnings = FALSE)
for (i in seq_len(nrow(md_full))) {
    chrom <- featureChromatograms(full, features = md_full$feature_id[i])
    png(paste0(tmpdr, "feature_", md_full$feature_id[i], ".png"),
        width = 12, height = 8, units = "cm", res = 600, pointsize = 4)
    plot(chrom, main = paste0(md_full$target_name[i], ": ", md_full$feature_id[i]),
         col = paste0(col_sample, 80), 
         peakBg = paste0(col_sample[chromPeaks(chrom)[, "sample"]], 40))
    grid()
    legend("topright", col = leg_sample,
           legend = names(leg_sample), lty = 1)
    abline(v = md_full$rtime[i], col = "red", lty = 3)
    dev.off()
}
```

# Comparison on annotated data 

- summary plot: the plot below is one of the summary plot that compare results
after prepossessing and normalization.

```{r fig.height=8, fig.width=6}
fts <- unique(md_full$feature_id)

# Intensity and missing values 
res_mh <- res_full[fts, res_full$solvent == "MEOH_H2O"]
res_m <- res_full[fts, res_full$solvent == "MEOH"]
res_a <- res_full[fts, res_full$solvent == "ACN"]
res_am <- res_full[fts, res_full$solvent == "ACN_MEOH"]

idx_fts <- cbind(
    MetOH_H2O = rowSums(is.na(assay(res_mh, "norm"))) < 2,
    ACN = rowSums(is.na(assay(res_a, "norm"))) < 2,
    MetOH = rowSums(is.na(assay(res_m, "norm"))) < 2,
    ACN_MetOH = rowSums(is.na(assay(res_am, "norm"))) < 2
)

res_mh <- res_mh[rowSums(is.na(assay(res_mh, "norm"))) < 2,]
res_m <- res_m[rowSums(is.na(assay(res_m, "norm"))) < 2,]
res_a <- res_a[rowSums(is.na(assay(res_a, "norm"))) < 2,]
res_am <- res_am[rowSums(is.na(assay(res_am, "norm"))) < 2,]

intensity <- cbind(
    MetOH_H2O = log2(as.numeric(assay(res_mh, "norm"))),
    ACN = log2(as.numeric(assay(res_a, "norm"))),
    MetOH = log2(as.numeric(assay(res_m, "norm"))),
    ACN_MetOH = log2(as.numeric(assay(res_am, "norm"))))

num_features <- cbind(
    MetOH_H2O = nrow(res_mh),
    ACN = nrow(res_a),
    MetOH = nrow(res_m),
    ACN_MetOH = nrow(res_am)
)

missing_values <- cbind(
    MetOH_H2O = sum(is.na(assay(res_mh, "raw")))/length(assay(res_mh, "raw")) * 100,
    ACN = sum(is.na(assay(res_a, "raw")))/length(assay(res_a, "raw")) * 100,
    MetOH = sum(is.na(assay(res_m, "raw")))/length(assay(res_m, "raw")) * 100,
    ACN_MetOH = sum(is.na(assay(res_am, "raw")))/length(assay(res_am, "raw")) * 100
)

#General plot - now without flagged features
layout(mat = matrix(1:3, ncol = 1), height = c(0.3, 0.3, 0.8))
par(mar = c(1, 4.5, 1, 3))
barplot(colSums(num_features),
    col = leg_solvent,
    ylab = "Number of features", space = 0.05)
barplot(c(missing_values),
        ylab = "% of missing values", col = leg_solvent, space = 0.05, 
        ylim = c(0, 60))
vioplot(intensity, 
        ylab = "Log2 intensity", col = leg_solvent, space = 0.05)
```

- number of features / intensity per rt slices 

```{r fig.height=9, fig.width=8}
# Bin features per RT slices
vc <- rowData(res_full)$rtmed 
breaks <- seq(0, max(vc, na.rm = TRUE) + 1, length.out = 15) |> 
    round(0)
cuts <- cut(vc, breaks = breaks, include.lowest = TRUE)

table(cuts)

num_features_solvent <- apply(idx_fts, MARGIN = 2, function(x) table(cuts[x]))

idx_fts <- as.data.frame(idx_fts)

ftc <-function(res_solvent, fts_idx) {
    tmp <- rowSums(assay(res_solvent, "norm"), na.rm = TRUE)
    cuts_tmp <- cuts[fts_idx]
    t <- split(tmp, cuts_tmp) |> 
        lapply(sum, na.rm = TRUE, simplify = TRUE)
    unlist(t)
}

intensity_solvent <- list(
    MetOH_H20 = ftc(res_mh, idx_fts$MetOH_H2O),
    ACN = ftc(res_a, idx_fts$ACN),
    MetOH = ftc(res_m, idx_fts$MetOH),
    ACN_MetOH = ftc(res_am, idx_fts$ACN_MetOH)
    )

# Transform intensity to log2 scale
intensity_solvent <- lapply(intensity_solvent, log2)

# Plot
layout(mat = matrix(1:2, ncol = 1), heights = c(0.5, 0.5))
par(mar = c(0.5, 4.5, 2, 3))

# Plot number of features
ylim_features <- c(0, max(unlist(num_features_solvent)))
plot(num_features_solvent[,1], col = leg_solvent[1], ylab = "Number of features",
     xlab = "", type = "b", pch = 16, xaxt = "n", ylim = ylim_features,
     main = "Analysis along the RT axis for each solvent")
for (i in 2:ncol(num_features_solvent)) {
  lines(num_features_solvent[,i], col = leg_solvent[i], type = "b", pch = 16)
}
axis(1, at = 1:length(num_features_solvent[,1]), labels = FALSE)
grid()
legend("top", legend = names(intensity_solvent), col = leg_solvent, pch = 16, 
       cex = 1, horiz = TRUE, bty = "n")

# Plot intensity
par(mar = c(5, 4.5, 2, 3))
ylim_intensity <- range(unlist(intensity_solvent), na.rm = TRUE)
plot(intensity_solvent[[1]], type = "b", pch = 16, xlab = "",
     ylab = "Log2 intensity", col = leg_solvent[1], xaxt = "n", ylim = ylim_intensity)
for (i in 2:length(intensity_solvent)) {
  lines(intensity_solvent[[i]], type = "b", pch = 16, col = leg_solvent[i])
}
axis(1, at = 1:length(intensity_solvent[[1]]), 
     labels = names(intensity_solvent[[1]]), las = 2, cex.axis = 0.8)
grid()
mtext("Retention time (s)", side = 1, line = 4, cex = 1)
```

- median of median

```{r fig.height=9, fig.width=8}
ftc <-function(res_solvent, fts_idx) {
    tmp <- rowMedians(assay(res_solvent, "norm"), na.rm = TRUE)
    cuts_tmp <- cuts[fts_idx]
    t <- split(tmp, cuts_tmp) |> 
        lapply(median, na.rm = TRUE, simplify = TRUE)
    unlist(t)
}

intensity_solvent <- list(
    MetOH_H20 = ftc(res_mh, idx_fts$MetOH_H2O),
    ACN = ftc(res_a, idx_fts$ACN),
    MetOH = ftc(res_m, idx_fts$MetOH),
    ACN_MetOH = ftc(res_am, idx_fts$ACN_MetOH)
    )

# Transform intensity to log2 scale
intensity_solvent <- lapply(intensity_solvent, log2)

# Plot
layout(mat = matrix(1:2, ncol = 1), heights = c(0.5, 0.5))
par(mar = c(0.5, 4.5, 2, 3))

# Plot number of features
ylim_features <- c(0, max(unlist(num_features_solvent)))
plot(num_features_solvent[,1], col = leg_solvent[1], ylab = "Number of features",
     xlab = "", type = "b", pch = 16, xaxt = "n", ylim = ylim_features,
     main = "Analysis along the RT axis for each solvent")
for (i in 2:ncol(num_features_solvent)) {
  lines(num_features_solvent[,i], col = leg_solvent[i], type = "b", pch = 16)
}
axis(1, at = 1:length(num_features_solvent[,1]), labels = FALSE)
grid()
legend("top", legend = names(intensity_solvent), col = leg_solvent, pch = 16, 
       cex = 1, horiz = TRUE, bty = "n")

# Plot intensity
par(mar = c(5, 4.5, 2, 3))
ylim_intensity <- range(unlist(intensity_solvent), na.rm = TRUE)
plot(intensity_solvent[[1]], type = "b", pch = 16, xlab = "",
     ylab = "Log2 intensity", col = leg_solvent[1], xaxt = "n", ylim = ylim_intensity)
for (i in 2:length(intensity_solvent)) {
  lines(intensity_solvent[[i]], type = "b", pch = 16, col = leg_solvent[i])
}
axis(1, at = 1:length(intensity_solvent[[1]]), 
     labels = names(intensity_solvent[[1]]), las = 2, cex.axis = 0.8)
grid()
mtext("Retention time (s)", side = 1, line = 4, cex = 1)
```


- overlap of features between solvents

```{r echo=TRUE}
# Create a data frame for the UpSet plot - need to fix
upset_df <- lapply(idx_fts, as.integer)
names(leg_solvent) <- names(upset_df)
# Plot the UpSet plot
library(UpSetR)
upset(as.data.frame(upset_df), sets = c("MetOH_H2O", "ACN", "MetOH", "ACN_MetOH"),
      sets.bar.color = leg_solvent, mainbar.y.label = "Number of common features", keep.order = TRUE, mainbar.y.max= 100)
```


```{r}
fts <- md_full$feature_id
Summary_table <- cbind(md_full[, c("feature_id", "rtime", "precursorMz", "target_name")], 
                       CV_MetOH_H2O = rowRsd(assay(res_mh, "norm")[fts, ], na.rm = TRUE, mad = TRUE),
                       CV_MetOH = rowRsd(assay(res_m, "norm")[fts, ], na.rm = TRUE, mad = TRUE),
                       CV_ACN = rowRsd(assay(res_a, "norm")[fts, ], na.rm = TRUE, mad = TRUE),
                       CV_ACN_MetOH = rowRsd(assay(res_am, "norm")[fts, ], na.rm = TRUE, mad = TRUE),
                       Average_int_MetOH_H2O = rowMeans(assay(res_mh, "norm")[fts, ], na.rm = TRUE),
                       Average_int_MetOH = rowMeans(assay(res_m, "norm")[fts, ], na.rm = TRUE),
                       Average_int_ACN = rowMeans(assay(res_a, "norm")[fts, ], na.rm = TRUE),
                       Average_int_ACN_MetOH = rowMeans(assay(res_am, "norm")[fts, ], na.rm = TRUE),
                       Missing_values_MetOH_H2O = rowSums(is.na(assay(res_mh, "raw")[fts, ])),
                       Missing_values_MetOH = rowSums(is.na(assay(res_m, "raw")[fts, ])),
                       Missing_values_ACN = rowSums(is.na(assay(res_a, "raw")[fts, ])),
                       Missing_values_ACN_MetOH = rowSums(is.na(assay(res_am, "raw")[fts, ]))) |>
    as.data.frame()
                       
cpt <- paste0("Summary table of the annotated compounds for each solvent type.")
pandoc.table(head(Summary_table), style = "rmarkdown", caption = cpt, split.tables = 150)
write.csv(Summary_table, file = paste0(dir, "Summary_table_VAMS.csv"))
```

Session and version info: 

```{r}
sessionInfo()
```