Author : Benoît BAILLIF email : [email protected]
This folder contains code related to the Frontiers in Chemistry publication : Exploring the Use of Compound-Induced Transcriptomic Data Generated From Cell Lines to Predict Compound Activity Toward Molecular Targets
The goal of this code is to preprocess data coming from the LINCS (CMap/L1000) and Pubchem (meta)data and to produce the figures, tables and most importantly models presented in the publication.
- GEO pages:
- GSE70138
- GSE92742
- Pubchem using an available Bioassay SQLite extract along with corresponding R package for data extraction
- LINCS data portal: to find additional Pubchem CID of profiled compounds ; used links are "outdated" and cannot be found currently
- Broad Institute Drug Repurposing Hub: to find TUBB active compounds that are not in Pubchem
Scripts were written using Jupyter Notebook from conda 4.8.3, with Python 3.7.6
-
download_raw_data.ipynb To download the required sources
-
perturbagen_and_related_signatures_metadata_processing.ipynb Compile the 2 GSE metadata Select compound perturbagens Find used compounds, meaning compounds having a 10 µM and 24 h signature in the 8 chosen core cell lines
-
pubchem_cid_extraction Find all available Pubchem CID for used compounds in the analysis
-
target_data_processing Produce the final activity matrix to be used downstream
-
pubchem_bioactivity_matrix_extraction.R Compute the bioactivity matrix using the bioassayR package along with the pubchem protein only SQLite file
-
signature_extraction.ipynb Extract signatures of used compounds from the gctx archives
-
morgan_fingerprints_and_signatures_tsne Compute t-SNE embeddings for used compounds and signature, to later plot the chemical and biological spaces
-
produce_space_plots Produce figures corresponding to chemical and biological space plots
-
TODO models Compute random forest models, store performances in csv files
-
TODO distance plots Produce quadrant plots and statistics for the modeled targets