Skip to content

MicrobiologyETHZ/nccr_metat_ws

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Binder link: Binder

Differential expression analysis with linear models

In the part of the workshop, we will reproduce some of the analysis from the paper Statistical approaches for differential expression analysis in metatranscriptomics. In this paper, the authors use different synthetic metatranscriptomic dataset to test a number of filtering, normalization and modeling strategies to identify the best analysis method.

Learning objectives

  • Understand the different filtering and normalization procedures for METAT data
  • Get hands-on experience modeling gene expression data
  • Understand the need for matching METAG data in METAT experiments

Data

The authors first generate a number of synthetic datasets. Specifically, they simulated microbial communities consisting of 100 different species across 100 samples. For details about the simulation, please refer to the original paper. The samples were than randomly assigned to a binary Phenotype, and 'spiked' with certain genes to be associated with the phenotype. For this workshop we will work with true-exp dataset, which has 10% of the transcripts strongly associated with the phenotype. All of the synthetic data generated in the paper can be found here.

Models

We will reproduce 3 different models from the paper:

Model Design Description
M1: Naive RNA TSS ~ Phenotype Community total sum scaled (TSS) abundance of a gene is modeled as a function of phenotype
M2: Within-taxon RNA taxon TSS ~ Phenotype Taxon TSS abundance ofa gene is modeled as a function of phenotype
M4: RNA/DNA ratio TSS(RNA)/TSS(DNA) ~ Phenotype Ratio of TSS RNA to TSS DNA is modeled as a function of phenotype

Filtering strategies

Author test out a few different filtering strategies on the data prior to running the models. We will use semi-strict filtering, and then time permitting, re-run the analysis with strict filtering and compare the results.

Semi-strict: Remove sample if both transcript abundance and species abundance are 0 (for models M1 and M2) or if both transcript and gene abundance are 0 (for model M4)

Strict: Remove sample if both transcript abundance or species abundance are 0 (for models M1 and M2), or if transcript or gene abundance are 0 (for model M4)

Download the data and code

The easiest way to follow along is to clone this repository and navigate to it in RStudio.

git clone https://github.com/MicrobiologyETHZ/nccr_metat_ws

The data can be found in the datasets/part3 folder. There are 3 files:

  • true-exp.mtx_abunds.tsv.gz contains simulated transcriptomic counts for each gene as well as phenotype information (1 or 0) across 100 samples.
  • true-exp.mgx_abunds.tsv.gz contains metagenomic counts for each gene across 100 samples
  • true-exp.mtx_spiked.tsv.gz contains 'ground truth' information about with genes are diffrentially expressed between the phenotypic groups

Setup

Loading libraries

# Setup

library(tidyverse)
library(stats)
library(stats4)
library(caret)

Loading the data

Here we're loading both metatranscriptomic (METAT) and metagenomic (METAG) abundances. For the Naive RNA (M1) and Within-taxon RNA (M2) models, we will only need the METAT data. For RNA/DNA ratio (M4), we will need both tables.


# METAT: metatranscriptomic data
# METAG: metagenomic data

# For models M1 and M2 we are only using METAT data. For model M4, we will need both METAT and METAG data

metat <- read.table("../datasets/part3/true-exp/true-exp.mtx_abunds.tsv.gz", header = TRUE,
                 sep = "\t", row.names=1)

metag <- read.table("../datasets/part3/true-exp/true-exp.mgx_abunds.tsv.gz", header = TRUE,
                    sep="\t", row.names=1)

Take a look at the data

METAT table contains Phenotype information as well as counts for 100 samples. We will separate this into 2 tables later. There are 2 phenotypes - 0 and 1. Each gene name is composed of 'species' name (ex. BUG0001), and 'orthologous group' name (ex. GROUP000001)

# Look at the data
# Each `gene name` is composed of 'species' name (ex. BUG0001), and 'orthologous group' name (ex. GROUP000001)

metat[0:5, 0:5]

Load the ground truth data

Ground truth data contains a list of genes that are differentially expressed between phenotypes. Overall, we have information on 100000 genes, of which ~10000 are 'differentially expressed'. -1 represents downregulation, and +1 upregulation.


ground_truth <- read.table("../datasets/part3/true-exp/true-exp.mtx_spiked.tsv.gz", 
                           header = FALSE,sep = "\t", 
                           col.names = c('gene_name', 'GT'))
head(ground_truth)
dim(ground_truth)

Pre-process the data

Here we first create a separte data containing all the meatadata about our samples (i.e. the Phenotype) and remove that from counts table (both METAT and METAG). We will create a new bug column that will show species designation for each transcript. Finally we create a vector of sample names.

# Create metatdata table (i.e. phenotype  for each sample)
metat_pheno <- metat[1:2, ] %>%
  t() %>%
  data.frame() 

# Remove metadata from METAT column
metat <- metat[-c(1, 2), ]

# Save METAG sequencing depth, we will need it later
metag_seqdepth <- metag[2,] %>% t()
colnames(metag_seqdepth) <- c('metag_seqdepth')

# Remove metadata from METAG table
metag <- metag[-c(1, 2), ]


# Assign each 'transcript' to a 'species'
metat$bug <- str_split_fixed(rownames(metat), "_", 2)[, 1]

# Get sample names
sample_cols <- grep("SAMPLE", names(metat), value = TRUE)

Calculate abundances of each taxa based on METAT data

For the Within-taxon model (M2), we will need to know the total abundance of each species, i.e. we need to sum all reads belonging to each species.

# Create table with 'taxonomic' abundances based on METAT data 
# We will need this for model M2.

bug_metat <- metat %>%
  group_by(bug) %>%
  summarize(across(all_of(sample_cols), sum)) %>% 
  pivot_longer(cols = -bug, names_to = "sample_id", values_to = "bug_abundance") 

head(bug_metat)

Go through example of building all the models for one gene

Put together all the information we need about 1 gene

Here we subset METAT and METAG data to only include information about 1 gene (BUG0013_GROUP001489). We also add the total species abundance of BUG0013 and phenotype data to get gene_df.

example_gene = "BUG0013_GROUP001489"
example_gene_bug <- str_split_fixed(example_gene, "_", 2)[, 1]

# Get METAT data for the gene
gene_df <- metat[example_gene, sample_cols] %>% t() %>% as.data.frame()

# Get METAG data for the gene
gene_metag <- metag[example_gene, sample_cols] %>% t() %>% as.data.frame() 
colnames(gene_metag) <- c("metag")

# Get species abundance
gene_bug_abundance <- bug_metat %>% filter(bug == example_gene_bug) %>% 
                      column_to_rownames('sample_id')

# Put all of this together with phenotype data
gene_df <- cbind(metat_pheno, gene_df, gene_metag, gene_bug_abundance, metag_seqdepth)
colnames(gene_df) <- c('Phenotype', 'METAT_seqdepth', "transcript_abundance",
                        "gene_abundance", 'species', 'METAT_species_abundance', 'METAG_seqdepth')
head(gene_df)

# look if there's a correlation between bug abundance and gene abundance across samples?

  • Let's see if there's a relationship between transcript abundance and gene or species abundances

# look if there's a correlation between bug abundance and gene abundance across samples?
ggplot(data=gene_df,aes(x=gene_abundance,y=transcript_abundance)) +
  geom_point() +
  scale_x_log10()+
  scale_y_log10()+
  theme_bw()


ggplot(data=gene_df,aes(x=METAT_species_abundance,y=transcript_abundance)) +
  geom_point() +
  scale_x_log10()+
  scale_y_log10()+
  theme_bw()

Filtering

For Naive RNA and Withing-taxon models (M1 and M2), semi-strict filtering removes samples if both transcript abundances and species abundances are 0 (i.e. keep samples if either transcript abundance, or species abundances is > 0). For RNA/DNA ratio model, semi-strict filtering removes samples if both transcript and gene abundances are 0 (i.e. keep samples, if either transcript or gene abundance has be > 0 )

# Semi-strict: Remove sample if both gene abundance and bug abundance are 0 (for models M1 and M2)
gene_df_semi <- gene_df %>% filter(transcript_abundance > 0 | METAT_species_abundance > 0) # 82 samples

# For M4, we filter based on METAG abundance of that gene
gene_df_semistrict_metag<- gene_df %>% filter(transcript_abundance > 0 | gene_abundance > 0)

For strict filtering, both transcript abundances and taxonomic (or gene abundances) must be > 0.

# Strict M1 and M2: remove sample if either gene abundance and species abundance are 0 (for models M1 and M2)
gene_df_strict <- gene_df %>% filter(transcript_abundance > 0  & METAT_species_abundance > 0) # 49 samples

# For M4, we filter based on METAG abundance of that gene
gene_df_strict_metag<- gene_df %>% filter(transcript_abundance > 0 & gene_abundance > 0)

Model 1: Naive RNA

  • We will do this example using strictly filtered dataframe
  • For this model, all we need to do is to normalize for sequencing depth, i.e. divide transcript abundance by total number of reads for each sample (total sum scaling, TSS)
  • Add a pseudocount, which in the paper was defined as half of the lowest transcript abundance and log transform
gene_df_strict['tss'] <- gene_df_strict['transcript_abundance']/gene_df_strict['METAT_seqdepth']
# Add pseudo count and log transform
pseudo_count <- min(gene_df_strict[gene_df_strict[, 'tss'] > 0, 'tss']) / 2
gene_df_strict['M1_naive'] <- log10(gene_df_strict[, 'tss'] + pseudo_count)

Model 2: Within-taxon normalization

  • For this model, we normalize transcript abundance of by the 'species abundance' for each sample
  • As before, we then add pseudocount and transform

# Normalising by species abundance
gene_df_strict['M2_within_taxon'] <- gene_df_strict['transcript_abundance']/gene_df_strict['METAT_species_abundance']
# Add pseudo count and log transform
pseudo_count <- min(gene_df_strict[gene_df_strict[, 'M2_within_taxon'] > 0, 'M2_within_taxon'], na.rm=TRUE) / 2
gene_df_strict['M2_within_taxon'] <- log10(gene_df_strict[, 'M2_within_taxon'] + pseudo_count)

Model 4: RNA/DNA ratio

  • Note that here we're using gene_df_strict_metag
  • First we perform TSS on transcript and gene abundances
  • We then calculate ration of trancript to gene abundances
  • Finally, we add a pseudocount and log transform
M4_col <- "M4_RNA_DNA_ratio"

# Note using a different dataframe here, because filtering was done differently
# First perform total sum scaling for both METAT and METAG data
gene_df_strict_metag['tss'] <- gene_df_strict_metag['transcript_abundance']/gene_df_strict_metag['METAT_seqdepth']
gene_df_strict_metag['metag_tss'] = gene_df_strict_metag['gene_abundance']/gene_df_strict_metag['METAG_seqdepth']

# Calculate ratio of METAT/METAG abundance
gene_df_strict_metag[M4_col] <- gene_df_strict_metag['tss']/gene_df_strict_metag['metag_tss']

# Add pseudocount and log transform
pseudo_count <- min(gene_df_strict_metag[gene_df_strict_metag[, M4_col] > 0, M4_col], na.rm=TRUE) / 2
gene_df_strict_metag[M4_col] <- log10(gene_df_strict_metag[, M4_col] + pseudo_count)

Let's look at how these models differ

toplot<-gene_df_strict_metag %>%
  rownames_to_column("sample") %>%
  full_join(rownames_to_column(gene_df_strict,"sample"),by="sample") %>%
  select(sample,M1_naive,M2_within_taxon,M4_RNA_DNA_ratio )


ggplot(data=toplot,aes(x=M1_naive,y=M2_within_taxon)) +
  geom_point() +
  theme_bw()

ggplot(data=toplot,aes(x=M1_naive,y=M4_RNA_DNA_ratio)) +
  geom_point() +
  theme_bw()
ggplot(data=toplot,aes(x=M2_within_taxon,y=M4_RNA_DNA_ratio)) +
  geom_point() +
  theme_bw()

Run linear models

# Run linear model

fixed_effects <- c("Phenotype")

# This function gets the coefficients, std error and p-values from the fit linear model
summary_function <- function(fit) {
  lm_summary <- summary(fit)$coefficients
  para <- as.data.frame(lm_summary)[-1, -3]
  para$name <- rownames(lm_summary)[-1]
  return(para)
}

  • For model M1 we are modelling log trnasformed TSS expression values (M1_naive) against the phenotype

  • For model M2 are modelling log transformed taxon normalised expression values (M2_within_taxon) against the phenotype


model_cols <- c("M1_naive", "M2_within_taxon")
models <- list()
for (model_col in model_cols){
    formula <- paste0(model_col, " ~ ", paste(fixed_effects, collapse = " + "))
    print(formula)
    md <- glm(formula = formula, data = gene_df_strict, family = gaussian()) %>%
      suppressWarnings(suppressMessages(update(., .~.)))
    models[[as.symbol(model_col)]] <- summary_function(md)
}
  • For model M4 we are modelling RNA/DNA ratio (M4_col) against the phenotype
formula <- paste0(M4_col, "~", "Phenotype")
print(formula)
md <- glm(formula = formula, data = gene_df_strict_metag, family = gaussian()) %>%
          suppressWarnings(suppressMessages(update(., .~.)))
models[[as.symbol(M4_col)]] <- summary_function(md)
summary <- do.call(rbind, models)

summary

Run models for all genes

  • To run this on all of the genes, we're putting this together into a function

# Run the model for each gene

model_gene <- function(gene_df, expr_col, filter_strategy='strict', fixed_effects){
  
  # Filter 
  if (filter_strategy == 'strict'){
    gene_df <- gene_df %>% filter((!!as.symbol(expr_col)) > 0 & bug_abundance > 0)
    gene_df_metag <- gene_df %>% filter((!!as.symbol(expr_col)) > 0 & metag > 0)
  }
  else if (filter_strategy == 'semistrict'){
    gene_df <- gene_df %>% filter((!!as.symbol(expr_col)) > 0 | bug_abundance > 0)
    gene_df_metag <- gene_df %>% filter((!!as.symbol(expr_col)) > 0 | metag > 0) 
  }
  else {print("No filtering applied")}
  
  # Normalize
  # M1: divide by sequencing depth
  M1_col <- paste0(expr_col, "_M1")
  gene_df[M1_col] <- gene_df[expr_col]/gene_df['SeqDepth']
  pseudo_count <- min(gene_df[gene_df[, M1_col] > 0, M1_col]) / 2
  gene_df[M1_col] <- log10(gene_df[, M1_col] + pseudo_count)
  
  #M2: divide by bug abundance
  M2_col <- paste0(expr_col, "_M2")
  gene_df[M2_col] <- gene_df[expr_col]/gene_df['bug_abundance']
  pseudo_count <- min(gene_df[gene_df[, M2_col] > 0, M2_col], na.rm=TRUE) / 2
  gene_df[M2_col] <- log10(gene_df[, M2_col] + pseudo_count)
  
  
  M4_col <- paste0(expr_col, "_M4")

  gene_df_metag['tss'] <- gene_df_metag[expr_col]/gene_df_metag['SeqDepth']
  pseudo_count <- min(gene_df_metag[gene_df_metag[, 'tss'] > 0, 'tss'], na.rm=TRUE) / 2
  gene_df_metag['tss'] <- gene_df_metag['tss'] + pseudo_count
  gene_df_metag['metag_tss'] = gene_df_metag['metag']/gene_df_metag['metag_seqdepth']
  pseudo_count <- min(gene_df_metag[gene_df_metag[, 'metag_tss'] > 0, 'metag_tss'], na.rm=TRUE) / 2
  gene_df_metag['metag_tss'] <-  gene_df_metag['metag_tss'] + pseudo_count
  gene_df_metag[M4_col] <- gene_df_metag['tss']/gene_df_metag['metag_tss']
  gene_df_metag[M4_col] <- log10(gene_df_metag[, M4_col])
  
  
  # Run the model
  expr_cols <- c(M1_col, M2_col)
  if (nrow(gene_df) > 0 && sum(gene_df[[expr_col]]) > 0) {
    models <- list()
    for (model_col in expr_cols){
      formula <- paste0(model_col, " ~ ", paste(fixed_effects, collapse = " + "))
      print(formula)
      md <- glm(formula = formula, data = gene_df, family = gaussian()) %>%
      suppressWarnings(suppressMessages(update(., .~.)))
      models[[as.symbol(model_col)]] <- summary_function(md)
    }
    
    formula <- paste0(M4_col, " ~ ", "Phenotype")
    print(formula)
    if (nrow(gene_df_metag) > 0&& sum(gene_df_metag[[expr_col]]) > 0){
    md <- glm(formula = formula, data = gene_df_metag, family = gaussian()) %>%
      suppressWarnings(suppressMessages(update(., .~.)))
    models[[as.symbol(M4_col)]] <- summary_function(md)}
    return (models)
  }
  
}


  • To make the code run faster, we're only going to test a subset of genes
  • Feel free to try this on the full dataframe
test_df <- metat %>% select(all_of(sample_cols)) %>% sample_n(1000, replace=FALSE)
outputs <- list()
for (i in 1:nrow(test_df)) {
  #Subset to gene dataframe and transform
  gene_df <- test_df[i, ] %>% t() %>% as.data.frame()
  expr_col <- colnames(gene_df)
  gene_metag <- metag[expr_col, sample_cols] %>% t() %>% as.data.frame() 
  colnames(gene_metag) <- c("metag")
  current_bug <- str_split_fixed(expr_col, "_", 2)[, 1]
  # Get information about the bug abundance for that gene across samples
  gene_bug_abundance <- bug_metat %>% filter(bug == current_bug) %>% column_to_rownames('sample_id')
  gene_df <- cbind(metat_pheno, gene_df, gene_metag, gene_bug_abundance, metag_seqdepth)
  
  outputs <- append(outputs, model_gene(gene_df, expr_col, 'semistrict', fixed_effects))
}

outputs <- do.call(rbind, outputs)
outputs$model <- str_split_fixed(rownames(outputs), "_", 3)[, 3]
outputs$gene_name <- str_split_fixed(rownames(outputs), "_M", 2)[, 1]
res <- outputs%>% select(gene_name, "Pr(>|t|)", model) %>% pivot_wider(names_from = model, values_from = "Pr(>|t|)")
res %>% head()

# Multiple testing correction 
res$M1_qvals <- as.numeric(p.adjust(res[["M1"]], method = "BH"))
res$M2_qvals <- as.numeric(p.adjust(res[["M2"]], method = "BH"))
res$M4_qvals <- as.numeric(p.adjust(res[["M4"]], method = "BH"))

Evaluate the results

Here we're going to see whether our linear models identify DE genes correctly by comparing our results to ground truth

  • To calculate FPR and TPR we assign all 'real' differentially expressed genes 1 and the rest 0
  • TPR (== Recall == Sensitivity, out of all actually DE genes, how many are we detecting)
  • FPR (== 1 - Specificity, out of all not DE genes, how many are we predicting as DE)
  • For model predictions, we assign 1 to all genes with q-val < 0.01, 0 otherwise
res <- res %>% 
        full_join(ground_truth, by='gene_name')


res <- res %>% mutate(GT = abs(coalesce(as.numeric(GT), 0))) %>% 
      mutate(M1_predicted = as.numeric(M1_qvals < 0.05), M2_predicted = as.numeric(M2_qvals < 0.05),
             M4_predicted = as.numeric(M4_qvals < 0.05))

calculate_rates <- function(predicted, reference){
  u <- union(predicted, reference)
  xtab <- table(factor(predicted, u), factor(reference, u))
  cm <- caret::confusionMatrix(xtab, positive = "1")
  
}


reference <- res$GT

metrics <- list()
cm_M1 <- calculate_rates(res$M1_predicted, reference)
metrics$M1 <- c(cm_M1$byClass[["Sensitivity"]], 1-cm_M1$byClass[['Specificity']])
names(metrics$M1) <- c('TPR', 'FPR')

cm_M2 <- calculate_rates(res$M2_predicted, reference)
metrics$M2 <- c(cm_M2$byClass[["Sensitivity"]], 1-cm_M2$byClass[['Specificity']])
names(metrics$M2) <- c('TPR', 'FPR')

cm_M4 <- calculate_rates(res$M4_predicted, reference)
metrics$M4 <- c(cm_M4$byClass[["Sensitivity"]], 1-cm_M4$byClass[['Specificity']])
names(metrics$M4) <- c('TPR', 'FPR')


final_df <- data.frame(metrics) %>% rownames_to_column('metric') %>%
  pivot_longer(cols=-c(metric), names_to='Model', values_to = 'Proportion')
  • Let's visualize the results with a basic bar plot
  • How does this compare to the published result?
  • If there's time,
p<-ggplot(data=final_df, aes(x=Model, y=Proportion)) +
  geom_bar(stat="identity")+ facet_wrap('metric') +
  theme_bw()
p

Homework

Using what you've learned today, reproduce the remaining models (M3, M5, and M6).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages