In the part of the workshop, we will reproduce some of the analysis from the paper Statistical approaches for differential expression analysis in metatranscriptomics. In this paper, the authors use different synthetic metatranscriptomic dataset to test a number of filtering, normalization and modeling strategies to identify the best analysis method.
- Understand the different filtering and normalization procedures for METAT data
- Get hands-on experience modeling gene expression data
- Understand the need for matching METAG data in METAT experiments
The authors first generate a number of synthetic datasets. Specifically, they simulated microbial communities consisting of 100 different species across 100 samples. For details about the simulation, please refer to the original paper. The samples were than randomly assigned to a binary Phenotype, and 'spiked' with certain genes to be associated with the phenotype. For this workshop we will work with true-exp
dataset, which has 10% of the transcripts strongly associated with the phenotype. All of the synthetic data generated in the paper can be found here.
We will reproduce 3 different models from the paper:
Model | Design | Description |
---|---|---|
M1: Naive RNA | TSS ~ Phenotype | Community total sum scaled (TSS) abundance of a gene is modeled as a function of phenotype |
M2: Within-taxon RNA | taxon TSS ~ Phenotype | Taxon TSS abundance ofa gene is modeled as a function of phenotype |
M4: RNA/DNA ratio | TSS(RNA)/TSS(DNA) ~ Phenotype | Ratio of TSS RNA to TSS DNA is modeled as a function of phenotype |
Author test out a few different filtering strategies on the data prior to running the models. We will use semi-strict
filtering, and then time permitting, re-run the analysis with strict
filtering and compare the results.
Semi-strict: Remove sample if both transcript abundance and species abundance are 0 (for models M1 and M2) or if both transcript and gene abundance are 0 (for model M4)
Strict: Remove sample if both transcript abundance or species abundance are 0 (for models M1 and M2), or if transcript or gene abundance are 0 (for model M4)
The easiest way to follow along is to clone this repository and navigate to it in RStudio.
git clone https://github.com/MicrobiologyETHZ/nccr_metat_ws
The data can be found in the datasets/part3
folder. There are 3 files:
true-exp.mtx_abunds.tsv.gz
contains simulated transcriptomic counts for each gene as well as phenotype information (1 or 0) across 100 samples.true-exp.mgx_abunds.tsv.gz
contains metagenomic counts for each gene across 100 samplestrue-exp.mtx_spiked.tsv.gz
contains 'ground truth' information about with genes are diffrentially expressed between the phenotypic groups
# Setup
library(tidyverse)
library(stats)
library(stats4)
library(caret)
Here we're loading both metatranscriptomic (METAT) and metagenomic (METAG) abundances. For the Naive RNA (M1) and Within-taxon RNA (M2) models, we will only need the METAT data. For RNA/DNA ratio (M4), we will need both tables.
# METAT: metatranscriptomic data
# METAG: metagenomic data
# For models M1 and M2 we are only using METAT data. For model M4, we will need both METAT and METAG data
metat <- read.table("../datasets/part3/true-exp/true-exp.mtx_abunds.tsv.gz", header = TRUE,
sep = "\t", row.names=1)
metag <- read.table("../datasets/part3/true-exp/true-exp.mgx_abunds.tsv.gz", header = TRUE,
sep="\t", row.names=1)
METAT table contains Phenotype information as well as counts for 100 samples. We will separate this into 2 tables later. There are 2 phenotypes - 0 and 1. Each gene name
is composed of 'species' name (ex. BUG0001), and 'orthologous group' name (ex. GROUP000001)
# Look at the data
# Each `gene name` is composed of 'species' name (ex. BUG0001), and 'orthologous group' name (ex. GROUP000001)
metat[0:5, 0:5]
Ground truth data contains a list of genes
that are differentially expressed between phenotypes. Overall, we have information on 100000 genes, of which ~10000 are 'differentially expressed'. -1 represents downregulation, and +1 upregulation.
ground_truth <- read.table("../datasets/part3/true-exp/true-exp.mtx_spiked.tsv.gz",
header = FALSE,sep = "\t",
col.names = c('gene_name', 'GT'))
head(ground_truth)
dim(ground_truth)
Here we first create a separte data containing all the meatadata about our samples (i.e. the Phenotype) and remove that from counts table (both METAT and METAG). We will create a new bug
column that will show species designation for each transcript. Finally we create a vector of sample names.
# Create metatdata table (i.e. phenotype for each sample)
metat_pheno <- metat[1:2, ] %>%
t() %>%
data.frame()
# Remove metadata from METAT column
metat <- metat[-c(1, 2), ]
# Save METAG sequencing depth, we will need it later
metag_seqdepth <- metag[2,] %>% t()
colnames(metag_seqdepth) <- c('metag_seqdepth')
# Remove metadata from METAG table
metag <- metag[-c(1, 2), ]
# Assign each 'transcript' to a 'species'
metat$bug <- str_split_fixed(rownames(metat), "_", 2)[, 1]
# Get sample names
sample_cols <- grep("SAMPLE", names(metat), value = TRUE)
For the Within-taxon model (M2), we will need to know the total abundance of each species, i.e. we need to sum all reads belonging to each species.
# Create table with 'taxonomic' abundances based on METAT data
# We will need this for model M2.
bug_metat <- metat %>%
group_by(bug) %>%
summarize(across(all_of(sample_cols), sum)) %>%
pivot_longer(cols = -bug, names_to = "sample_id", values_to = "bug_abundance")
head(bug_metat)
Here we subset METAT and METAG data to only include information about 1 gene (BUG0013_GROUP001489
). We also add the total species abundance of BUG0013
and phenotype data to get gene_df
.
example_gene = "BUG0013_GROUP001489"
example_gene_bug <- str_split_fixed(example_gene, "_", 2)[, 1]
# Get METAT data for the gene
gene_df <- metat[example_gene, sample_cols] %>% t() %>% as.data.frame()
# Get METAG data for the gene
gene_metag <- metag[example_gene, sample_cols] %>% t() %>% as.data.frame()
colnames(gene_metag) <- c("metag")
# Get species abundance
gene_bug_abundance <- bug_metat %>% filter(bug == example_gene_bug) %>%
column_to_rownames('sample_id')
# Put all of this together with phenotype data
gene_df <- cbind(metat_pheno, gene_df, gene_metag, gene_bug_abundance, metag_seqdepth)
colnames(gene_df) <- c('Phenotype', 'METAT_seqdepth', "transcript_abundance",
"gene_abundance", 'species', 'METAT_species_abundance', 'METAG_seqdepth')
head(gene_df)
# look if there's a correlation between bug abundance and gene abundance across samples?
- Let's see if there's a relationship between transcript abundance and gene or species abundances
# look if there's a correlation between bug abundance and gene abundance across samples?
ggplot(data=gene_df,aes(x=gene_abundance,y=transcript_abundance)) +
geom_point() +
scale_x_log10()+
scale_y_log10()+
theme_bw()
ggplot(data=gene_df,aes(x=METAT_species_abundance,y=transcript_abundance)) +
geom_point() +
scale_x_log10()+
scale_y_log10()+
theme_bw()
For Naive RNA and Withing-taxon models (M1 and M2), semi-strict filtering removes samples if both transcript abundances and species abundances are 0 (i.e. keep samples if either transcript abundance, or species abundances is > 0). For RNA/DNA ratio model, semi-strict filtering removes samples if both transcript and gene abundances are 0 (i.e. keep samples, if either transcript or gene abundance has be > 0 )
# Semi-strict: Remove sample if both gene abundance and bug abundance are 0 (for models M1 and M2)
gene_df_semi <- gene_df %>% filter(transcript_abundance > 0 | METAT_species_abundance > 0) # 82 samples
# For M4, we filter based on METAG abundance of that gene
gene_df_semistrict_metag<- gene_df %>% filter(transcript_abundance > 0 | gene_abundance > 0)
For strict filtering, both transcript abundances and taxonomic (or gene abundances) must be > 0.
# Strict M1 and M2: remove sample if either gene abundance and species abundance are 0 (for models M1 and M2)
gene_df_strict <- gene_df %>% filter(transcript_abundance > 0 & METAT_species_abundance > 0) # 49 samples
# For M4, we filter based on METAG abundance of that gene
gene_df_strict_metag<- gene_df %>% filter(transcript_abundance > 0 & gene_abundance > 0)
- We will do this example using strictly filtered dataframe
- For this model, all we need to do is to normalize for sequencing depth, i.e. divide transcript abundance by total number of reads for each sample (total sum scaling, TSS)
- Add a pseudocount, which in the paper was defined as half of the lowest transcript abundance and log transform
gene_df_strict['tss'] <- gene_df_strict['transcript_abundance']/gene_df_strict['METAT_seqdepth']
# Add pseudo count and log transform
pseudo_count <- min(gene_df_strict[gene_df_strict[, 'tss'] > 0, 'tss']) / 2
gene_df_strict['M1_naive'] <- log10(gene_df_strict[, 'tss'] + pseudo_count)
- For this model, we normalize transcript abundance of by the 'species abundance' for each sample
- As before, we then add pseudocount and transform
# Normalising by species abundance
gene_df_strict['M2_within_taxon'] <- gene_df_strict['transcript_abundance']/gene_df_strict['METAT_species_abundance']
# Add pseudo count and log transform
pseudo_count <- min(gene_df_strict[gene_df_strict[, 'M2_within_taxon'] > 0, 'M2_within_taxon'], na.rm=TRUE) / 2
gene_df_strict['M2_within_taxon'] <- log10(gene_df_strict[, 'M2_within_taxon'] + pseudo_count)
- Note that here we're using
gene_df_strict_metag
- First we perform TSS on transcript and gene abundances
- We then calculate ration of trancript to gene abundances
- Finally, we add a pseudocount and log transform
M4_col <- "M4_RNA_DNA_ratio"
# Note using a different dataframe here, because filtering was done differently
# First perform total sum scaling for both METAT and METAG data
gene_df_strict_metag['tss'] <- gene_df_strict_metag['transcript_abundance']/gene_df_strict_metag['METAT_seqdepth']
gene_df_strict_metag['metag_tss'] = gene_df_strict_metag['gene_abundance']/gene_df_strict_metag['METAG_seqdepth']
# Calculate ratio of METAT/METAG abundance
gene_df_strict_metag[M4_col] <- gene_df_strict_metag['tss']/gene_df_strict_metag['metag_tss']
# Add pseudocount and log transform
pseudo_count <- min(gene_df_strict_metag[gene_df_strict_metag[, M4_col] > 0, M4_col], na.rm=TRUE) / 2
gene_df_strict_metag[M4_col] <- log10(gene_df_strict_metag[, M4_col] + pseudo_count)
Let's look at how these models differ
toplot<-gene_df_strict_metag %>%
rownames_to_column("sample") %>%
full_join(rownames_to_column(gene_df_strict,"sample"),by="sample") %>%
select(sample,M1_naive,M2_within_taxon,M4_RNA_DNA_ratio )
ggplot(data=toplot,aes(x=M1_naive,y=M2_within_taxon)) +
geom_point() +
theme_bw()
ggplot(data=toplot,aes(x=M1_naive,y=M4_RNA_DNA_ratio)) +
geom_point() +
theme_bw()
ggplot(data=toplot,aes(x=M2_within_taxon,y=M4_RNA_DNA_ratio)) +
geom_point() +
theme_bw()
# Run linear model
fixed_effects <- c("Phenotype")
# This function gets the coefficients, std error and p-values from the fit linear model
summary_function <- function(fit) {
lm_summary <- summary(fit)$coefficients
para <- as.data.frame(lm_summary)[-1, -3]
para$name <- rownames(lm_summary)[-1]
return(para)
}
-
For model M1 we are modelling log trnasformed TSS expression values (M1_naive) against the phenotype
-
For model M2 are modelling log transformed taxon normalised expression values (M2_within_taxon) against the phenotype
model_cols <- c("M1_naive", "M2_within_taxon")
models <- list()
for (model_col in model_cols){
formula <- paste0(model_col, " ~ ", paste(fixed_effects, collapse = " + "))
print(formula)
md <- glm(formula = formula, data = gene_df_strict, family = gaussian()) %>%
suppressWarnings(suppressMessages(update(., .~.)))
models[[as.symbol(model_col)]] <- summary_function(md)
}
- For model M4 we are modelling RNA/DNA ratio (M4_col) against the phenotype
formula <- paste0(M4_col, "~", "Phenotype")
print(formula)
md <- glm(formula = formula, data = gene_df_strict_metag, family = gaussian()) %>%
suppressWarnings(suppressMessages(update(., .~.)))
models[[as.symbol(M4_col)]] <- summary_function(md)
summary <- do.call(rbind, models)
summary
- To run this on all of the genes, we're putting this together into a function
# Run the model for each gene
model_gene <- function(gene_df, expr_col, filter_strategy='strict', fixed_effects){
# Filter
if (filter_strategy == 'strict'){
gene_df <- gene_df %>% filter((!!as.symbol(expr_col)) > 0 & bug_abundance > 0)
gene_df_metag <- gene_df %>% filter((!!as.symbol(expr_col)) > 0 & metag > 0)
}
else if (filter_strategy == 'semistrict'){
gene_df <- gene_df %>% filter((!!as.symbol(expr_col)) > 0 | bug_abundance > 0)
gene_df_metag <- gene_df %>% filter((!!as.symbol(expr_col)) > 0 | metag > 0)
}
else {print("No filtering applied")}
# Normalize
# M1: divide by sequencing depth
M1_col <- paste0(expr_col, "_M1")
gene_df[M1_col] <- gene_df[expr_col]/gene_df['SeqDepth']
pseudo_count <- min(gene_df[gene_df[, M1_col] > 0, M1_col]) / 2
gene_df[M1_col] <- log10(gene_df[, M1_col] + pseudo_count)
#M2: divide by bug abundance
M2_col <- paste0(expr_col, "_M2")
gene_df[M2_col] <- gene_df[expr_col]/gene_df['bug_abundance']
pseudo_count <- min(gene_df[gene_df[, M2_col] > 0, M2_col], na.rm=TRUE) / 2
gene_df[M2_col] <- log10(gene_df[, M2_col] + pseudo_count)
M4_col <- paste0(expr_col, "_M4")
gene_df_metag['tss'] <- gene_df_metag[expr_col]/gene_df_metag['SeqDepth']
pseudo_count <- min(gene_df_metag[gene_df_metag[, 'tss'] > 0, 'tss'], na.rm=TRUE) / 2
gene_df_metag['tss'] <- gene_df_metag['tss'] + pseudo_count
gene_df_metag['metag_tss'] = gene_df_metag['metag']/gene_df_metag['metag_seqdepth']
pseudo_count <- min(gene_df_metag[gene_df_metag[, 'metag_tss'] > 0, 'metag_tss'], na.rm=TRUE) / 2
gene_df_metag['metag_tss'] <- gene_df_metag['metag_tss'] + pseudo_count
gene_df_metag[M4_col] <- gene_df_metag['tss']/gene_df_metag['metag_tss']
gene_df_metag[M4_col] <- log10(gene_df_metag[, M4_col])
# Run the model
expr_cols <- c(M1_col, M2_col)
if (nrow(gene_df) > 0 && sum(gene_df[[expr_col]]) > 0) {
models <- list()
for (model_col in expr_cols){
formula <- paste0(model_col, " ~ ", paste(fixed_effects, collapse = " + "))
print(formula)
md <- glm(formula = formula, data = gene_df, family = gaussian()) %>%
suppressWarnings(suppressMessages(update(., .~.)))
models[[as.symbol(model_col)]] <- summary_function(md)
}
formula <- paste0(M4_col, " ~ ", "Phenotype")
print(formula)
if (nrow(gene_df_metag) > 0&& sum(gene_df_metag[[expr_col]]) > 0){
md <- glm(formula = formula, data = gene_df_metag, family = gaussian()) %>%
suppressWarnings(suppressMessages(update(., .~.)))
models[[as.symbol(M4_col)]] <- summary_function(md)}
return (models)
}
}
- To make the code run faster, we're only going to test a subset of genes
- Feel free to try this on the full dataframe
test_df <- metat %>% select(all_of(sample_cols)) %>% sample_n(1000, replace=FALSE)
outputs <- list()
for (i in 1:nrow(test_df)) {
#Subset to gene dataframe and transform
gene_df <- test_df[i, ] %>% t() %>% as.data.frame()
expr_col <- colnames(gene_df)
gene_metag <- metag[expr_col, sample_cols] %>% t() %>% as.data.frame()
colnames(gene_metag) <- c("metag")
current_bug <- str_split_fixed(expr_col, "_", 2)[, 1]
# Get information about the bug abundance for that gene across samples
gene_bug_abundance <- bug_metat %>% filter(bug == current_bug) %>% column_to_rownames('sample_id')
gene_df <- cbind(metat_pheno, gene_df, gene_metag, gene_bug_abundance, metag_seqdepth)
outputs <- append(outputs, model_gene(gene_df, expr_col, 'semistrict', fixed_effects))
}
outputs <- do.call(rbind, outputs)
outputs$model <- str_split_fixed(rownames(outputs), "_", 3)[, 3]
outputs$gene_name <- str_split_fixed(rownames(outputs), "_M", 2)[, 1]
res <- outputs%>% select(gene_name, "Pr(>|t|)", model) %>% pivot_wider(names_from = model, values_from = "Pr(>|t|)")
res %>% head()
- We're not done yet. now we need to perform multiple testing correction
- Here we perform Benjamini-Hochberg correction
# Multiple testing correction
res$M1_qvals <- as.numeric(p.adjust(res[["M1"]], method = "BH"))
res$M2_qvals <- as.numeric(p.adjust(res[["M2"]], method = "BH"))
res$M4_qvals <- as.numeric(p.adjust(res[["M4"]], method = "BH"))
Here we're going to see whether our linear models identify DE genes correctly by comparing our results to ground truth
- To calculate FPR and TPR we assign all 'real' differentially expressed genes 1 and the rest 0
- TPR (== Recall == Sensitivity, out of all actually DE genes, how many are we detecting)
- FPR (== 1 - Specificity, out of all not DE genes, how many are we predicting as DE)
- For model predictions, we assign 1 to all genes with q-val < 0.01, 0 otherwise
res <- res %>%
full_join(ground_truth, by='gene_name')
res <- res %>% mutate(GT = abs(coalesce(as.numeric(GT), 0))) %>%
mutate(M1_predicted = as.numeric(M1_qvals < 0.05), M2_predicted = as.numeric(M2_qvals < 0.05),
M4_predicted = as.numeric(M4_qvals < 0.05))
calculate_rates <- function(predicted, reference){
u <- union(predicted, reference)
xtab <- table(factor(predicted, u), factor(reference, u))
cm <- caret::confusionMatrix(xtab, positive = "1")
}
reference <- res$GT
metrics <- list()
cm_M1 <- calculate_rates(res$M1_predicted, reference)
metrics$M1 <- c(cm_M1$byClass[["Sensitivity"]], 1-cm_M1$byClass[['Specificity']])
names(metrics$M1) <- c('TPR', 'FPR')
cm_M2 <- calculate_rates(res$M2_predicted, reference)
metrics$M2 <- c(cm_M2$byClass[["Sensitivity"]], 1-cm_M2$byClass[['Specificity']])
names(metrics$M2) <- c('TPR', 'FPR')
cm_M4 <- calculate_rates(res$M4_predicted, reference)
metrics$M4 <- c(cm_M4$byClass[["Sensitivity"]], 1-cm_M4$byClass[['Specificity']])
names(metrics$M4) <- c('TPR', 'FPR')
final_df <- data.frame(metrics) %>% rownames_to_column('metric') %>%
pivot_longer(cols=-c(metric), names_to='Model', values_to = 'Proportion')
- Let's visualize the results with a basic bar plot
- How does this compare to the published result?
- If there's time,
p<-ggplot(data=final_df, aes(x=Model, y=Proportion)) +
geom_bar(stat="identity")+ facet_wrap('metric') +
theme_bw()
p
Using what you've learned today, reproduce the remaining models (M3, M5, and M6).