Skip to content

BoostDiff - Inference of differential networks from gene expression data

License

Notifications You must be signed in to change notification settings

scibiome/boostdiff_inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BoostDiff

BoostDiff (Boosted Differential Trees) - Tree-based Inference of Differential Networks from Gene Expression Data

General info

BoostDiff is a tree-based method for inferring differential networks from large-scale transcriptomics data by simultaneously comparing gene expression from two biological conditions (e.g. disease vs. control conditions). The network is inferred by building modified AdaBoost ensembles of differential trees as base learners. BoostDiff modifies regression trees to use differential variance improvement (DVI) as the novel splitting criterion.

Installation

We create the conda environment from the environment.yml provided and install the package from git:

git clone https://github.com/gihannagalindez/boostdiff_inference.git && cd boostdiff_inference

conda env create --name bdenv --file environment.yml
conda activate bdenv

pip install .

Data input

BoostDiff accepts two filenames corresponding to gene expression matrices from the disease and control conditions. Each input is a tab-separated file. The rows correspond to genes, while the columns correspond to samples. Note that the two inputs should have the same set of features, where the first column should be named "Gene".

Disease expression data:

Gene Disease1 Disease2 Disease3
ACE2 0.345 0.147 0.267
APP 0.567 0.794 0.590

Control expression data:

Gene Control1 Control2 Control3
ACE2 0.084 0.147 0.91
APP 0.567 0.794 0.590

Parameters:

  1. run(file_disease, file_control, output_folder, n_estimators,
    n_features, n_subsamples, keyword=keyword, n_processes=n_processes)
  • file_disease: string, path to the disease expression data
  • file_control: string, path to the control expression data
  • output_folder: string, the folder where the output will be generated
  • n_estimators: int, the no. of trees to be built in the modified AdaBoost ensemble
  • n_features: int, the no. of regulators (predictor genes/features) to be used
  • n_subsamples: int, the no. of bootstrapped samples to be used per condition (disease samples and control samples) when building the tree ensemble
  • min_samples_leaf: int (default = 2), the minimum no. of samples in a leaf
  • min_samples_split: int (default = 6), the minimum no. of samples required to split an internal node
  • max_depth: int (default = 2), the maximum depth of a differential tree; max_depth=0 means tree stumps will be built
  • min_samples: int (default = 15), the minimum no. of samples required per condition (disease samples and control samples)
  • learning_rate: float (default = 1.0), the minimum no. of samples required per condition (disease samples and control samples)
  • loss: string (default = "square"), the loss used for AdaBoost {"square","exponential","linear"}
  • n_processes int (default = 1"), for parallel computing
  • keyword string, keyword used for naming the output files
  • regulators: the string 'all' if all genes can be considered potential predictors or a list of gene names to be used as predictors
  • normalize string or bool:, if False, then no normalization will be performed; default is "unit_variance" normalization

Datasets and preprocessing

All datasets in the manuscript are publicly available. The COVID-19 and Crohn's disease datasets are available in the Gene Expression Omnibus (GEO) database with accession numbers GSE156063 and GSE126124, respectively. The TCGA datasets can be downloaded from https://xenabrowser.net/. The B. subtilis datasets can be found in GEO with the accession number GSE27219. Scripts for preprocessing the COVID-19, Crohn's disease, and TCGA breast cancer and prostate adenocarcinoma datasets can be found in tutorial > preprocessing.

If you want to analyze other datasets that are available on the GEO database, you can also use the GEOquery package in R. Depending on the study and format of the raw data, you might need to map the genes to the desired genes IDs, filter, and/or preprocess them to select specific samples to analyze for your study. Here is an example on how to download the Crohn's disease dataset that was used in the paper:

library(GEOquery)
library(biomaRt)
library(dplyr)

# ====================Crohn's disease===================
# Data source: GEO database

# Get dataset
gset <- getGEO("GSE126124", GSEMatrix =TRUE)
res = pData(gset[[1]])
df_expr = data.frame(exprs(gset[[1]]))

# Get data from colon tissue
df_tiss = res[res$`tissue:ch1` == "colon biopsy",]

names_ctrl = rownames(df_tiss[df_tiss$`disease type:ch1` == "Control",])
names_crohns = rownames(df_tiss[df_tiss$`disease type:ch1` == "Crohn's Disease",])

# Map and aggregate
# Retrieve the mappings from biomart
mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", host = "https://www.ensembl.org", path = "/biomart/martservice", dataset = "hsapiens_gene_ensembl")
hgnc_mapping <- getBM(attributes = c("affy_hugene_1_0_st_v1", "hgnc_symbol"), filters = "affy_hugene_1_0_st_v1", values=rownames(df_expr), mart = mart)
indicesLookup <- match(rownames(df_expr), hgnc_mapping[["affy_hugene_1_0_st_v1"]])
df_expr$Gene <- hgnc_mapping[indicesLookup, "hgnc_symbol"]
df_new <- df_expr %>% mutate_if(is.character, na_if, c('')) %>% na.omit
df_new <- df_new %>% group_by(hgnc) %>% summarise_all("mean")

# Create expression data for Crohn's and control
df_crohns = df_new[, c("hgnc", names_crohns)]
df_ctrl = df_new[, c("hgnc", names_ctrl)]

# Write expression data
file_crohns = "df_crohns.txt"
file_ctrl_crohns = "df_ctrl_crohns.txt"
write.table(df_crohns, file = file_crohns, append = FALSE, quote = FALSE, sep = "\t", row.names = FALSE)
write.table(df_ctrl_crohns, file = file_ctrl_crohns, append = FALSE, quote = FALSE, sep = "\t", row.names = FALSE)

Running BoostDiff

Note: BoostDiff was tested on linux. To import the BoostDiff package:

from boostdiff.main_boostdiff import BoostDiff

The ideal number of samples per condition as input to BoostDiff would be at least 30 samples each. For real transcriptomics datasets, we recommend to set a relatively low number of base learners, e.g. set n_estimators=50, so that 50 adaptively boosted differential trees will be built to avoid overfitting.

For the demo, the Crohn's disease dataset is too large and will need a high-performance computing cluster to run. So in this example, we run BoostDiff on the smaller B. subtilis salt stress response dataset. The script can be found in tutorial > 01_boostdiff_on_bsubt.py.

from boostdiff.main_boostdiff import BoostDiff
import pandas as pd
import numpy as np
import os

file_etha = "../data/b_subtilis_salt/exprs_salt.txt"
file_ctrl = "../data/b_subtilis_salt/exprs_smm.txt"

n_processes = 4
n_estimators = 50
n_features = 2500
n_subsamples = 20
keyword = "saltvssmm"
normalize = True
loss = "exponential"

output_folder = "../outputs_{}".format(keyword)
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

model = BoostDiff()
model.run(file_etha, file_ctrl, output_folder, n_estimators, 
                  n_features, n_subsamples, n_processes=n_processes, 
                  keyword=keyword,normalize=normalize,
                  regulators="all", loss=loss)

BoostDiff will output two subfolders, each containing two txt files.

Note that BoostDiff runs the algorithm twice:
Run 1: The disease condition will be used as the target condition (with control condition as baseline). Results will be generated in the subfolder "disease".
Run 2: The control/healthy condition will be used as the target condition (with disease condition as baseline). Results will be generated in the subfolder "control".

In each subfolder, the first txt file shows the data for the mean difference in prediction error between the disease and control samples after training the boosted differential trees. The second txt file contains the raw output network.

Postprocessing: Filtering and Visualization

For postprocessing, we will create a different conda environment using the environment_vis.yml:

conda create --name bd_post --file environment_vis.yml
conda activate bd_post

To obtain the final differential network, the raw network should be filtered for target genes in which BoostDiff found a more predictive model for the target condition. This additional step is crucial and part of the pipeline, as a trained model will not always be more predictive of a target condition. For edge filtering, we recommend to start with p=3 to filter for 3rd percentile of the top target genes with the largest differences in prediction errors between the disease and control conditions.

The notebook containing the code for filtering and visualization can be found in tutorial > 02_tutorial_visualization.ipynb.

import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os

# Load functions for filtering and visualization
import sys
sys.path.insert(1, "../filtering_visualization")
from postprocessing import *


output_folder = "../outputs_tutorial" # where the BoostDiff outputs are located
keyword = "saltvsmm" # specify the keyword used when running BoostDiff

file_diff_dis = os.path.join(output_folder, "disease", "differences_{}.txt".format(keyword))
file_diff_con = os.path.join(output_folder, "control", "differences_{}.txt".format(keyword))

# Specify the output files containing the network after running BoostDiff
file_net_dis = os.path.join(output_folder, "disease", "boostdiff_network_{}.txt".format(keyword))
file_net_con = os.path.join(output_folder, "control", "boostdiff_network_{}.txt".format(keyword))

plot_histogram(output_folder, keyword)

mean_diff_pred

# Filter the network
df_dis = filter_network(file_net_dis, file_diff_dis, p=3, n_top_edges=150)
df_con = filter_network(file_net_con, file_diff_con, p=3, n_top_edges=150)

# Map the gene IDs to gene names for visualizing the network
file_mapping = "../data/bsubt_map.txt"
df_map = pd.read_csv(file_mapping, sep="\t")
df_map = df_map[["<strong>ID</strong><strong>","GeneSymbol</strong><strong>"]]
df_map.columns = ["id","symbol"]
dict_map = dict(zip(df_map.id, df_map.symbol))
dict_map = {int(float(k)):v for k,v in dict_map.items()}

df_dis = df_dis.replace(dict_map)
df_con = df_con.replace(dict_map)

# Plot the resulting network using pygraphviz layouts
# You can use any of the layouts: gv_layout = ["fdp","neato","sfdp"]
file_grn = r"../outputs_tutorial/diff_grn.svg"
plot_grn(df_dis, df_con, layout="graphviz_layout", gv_layout="fdp", show_conflicting=True, 
         fontsize=18, filename=None)

grn_vis

# Plot correlation distributions per target condition

# Load the expression data
file_dis_expr = r"../data/b_subtilis_salt/exprs_salt.txt"
file_con_expr = r"../data/b_subtilis_salt/exprs_smm.txt"

# Get correlations of top edges and randomly selected edges
df_dis, df_con, df_rand = get_correlations(file_dis_expr, file_con_expr, output_folder, 
                                           percentile=3, keyword=keyword, index="Gene", 
                                           n_edges=150)

# Visualize using violin plot
title = "Bacillus subtilis salt stress response"
file_grn = r"../outputs_tutorial/corr_dists_salt_stress.svg"
plot_corr_dists(df_dis, df_con, df_rand, title=title, filename=file_grn, bw_adjust=0.07)

corr_dists

Memory and runtime

The memory requirements depend on the number of genes in the expession dataset and the number of samples in each condition. For the real transcriptomics datasets on human disease, BoostDiff should be run on a high-performance computing cluster. For reference, the runtime for the COVID-19 dataset (GSE156063) used in the paper with ~15,000 genes and around 100 samples per condition with n_processes=10 is ~9hrs on a local server with 64 cores and 500GB RAM. For small datasets, such as the B. subtilis dataset with ~1,000 genes and 48 samples per condition, the runtime is ~13mins for n_processes=4.

Citation

Galindez, G. G., List, M., Baumbach, J., Blumenthal, D. B., & Kacprowski, T. (2022). Inference of differential gene regulatory networks from gene expression data using boosted differential trees. bioRxiv. doi: https://doi.org/10.1101/2022.09.26.509450.

References

[1] Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
[2] Huynh-Thu, V. A., Irrthum, A., Wehenkel, L., & Geurts, P. (2010). Inferring regulatory networks from expression data using tree-based methods. PloS one, 5(9), e12776.
[3] Bhuva, D. D., Cursons, J., Smyth, G. K., & Davis, M. J. (2019). Differential co-expression-based detection of conditional relationships in transcriptional data: comparative analysis and application to breast cancer. Genome biology, 20(1), 1-21.

Contact

Gihanna Galindez: [email protected]

About

BoostDiff - Inference of differential networks from gene expression data

Resources

License

Stars

Watchers

Forks