Skip to content

HassounLab/PROXIMAL2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PROXIMAL2 PIPELINE

Read more about the method in our [paper] .

1. INPUT GENERATION

GenerateInput.py: script to generate the metabolite and reaction files to generate the operators. The queries need to be created separately and indipendently. It analyzes the reactions and remove the cofactors from the substrates and products. If a reaction represents a transformations just between cofactors or all the substrates or all the products are cofactors, the reaction will be removed. To annotate the metabolites in chemical structures there are few databases that are parsed: PubChem, HMDB, KEGG, MetaNetX, RetroRules.

The function takes as INPUT two files, that should be saved within the input folder:

  • reactions: .csv file of the reactions of interest. Tabulator as separator. It must have the following columns: "id", "formula", "EC".
  • metabolites: .csv file of the metabolites that can be involved in the reactions. Tabulator as separator. It must have the following columns, even if empty: "name", "hmdb", "kegg", "metanetx". If the file is empty or a metabolite in the reactions is not present in the file, it will be used just the name from the reaction and the eventual information present in RetroRules, otherwise it will be excluded.

The OUTPUT of the function are saved in the input folder as well, labeled as:

  • reachableMolecules.csv: .csv containing the structures of the molecules included in the templates.
  • templateReactions.csv: .csv having the definition of the reaction templates.

2. ENZYME PROMISCUITY ANALYSIS

runPROX2.py: main script to run that import all the needed functions

In proximal_functions folder there are 4 code files:

  • Operators2: to create the operators.
  • Products2: to check within the query the possibility to apply the operators.
  • GenerateMolFiles2: to create the predicted products.
  • Common2: collect few functions needed in the other steps.

PROXIMAL2 FILES AND GENERAL EXPLANATION

INPUT

In the APPLICATION FILES section there is the files import.

  • molecules_of_interest = queries of interest. Defined in a .csv file with no header and the tabulator as separator. The algorithm uses Smiles. If the query is expressed by InChI, the algorithm already implemented the generation of the Smiles (comment the line if not needed).
  • metabolites = molecules included in the reaction templates with their structure defined as Smiles. Generated by GenerateInput.py script.
  • reaction_list = list of reactions of interest. Generated by GenerateInput.py script.

Moreover, there is the definition of the pathways for the outputs (operators and products) and the final list of compound pairs:

  • OP_CACHE_DIRECTORY: pathway to operator output folder.
  • OUTPUT_DIRECTORY: pathway to product output folder.
  • path_finalReactions: pathway where to store the final list of compound pairs.

OUTPUT

There will be create a folder for any query of interest label with the generated ID "MetX", where X represents the number corresponding to the counting of the queries. Within the folder of the query, there will be created the folders containing the products related to any applied pair.

Any product output within the folder of the related pair is defined in json format, containing the following fields:

  • GeneratedProduct:
    • smiles: may be not generated. It means the algorithm can apply the modifications but RDKit do not find chemical sense about the prediction.
    • mol: mol text generated throughout the algorithm.
  • TemplateReaction:
    • ec: enzyme related to the template reaction pair.
    • ID: reaction IDs used for the prediction.
    • Substrate: Substrate of the used pair.
    • Product: Product of the used pair.
  • QueryInformation:
    • name: name of the molecule query.
    • ID: ID of the molecule query.
    • smiles: original Smiles of the query.

STEPS

  • ExtractPairs: pair extraction to generate the proper association and redundancy removal (in the input folder will be saved the definitive list of reaction pairs.)
  • GenerateOperators: operators generation.
  • GenerateProducts: check possible application to query.
  • GenerateMolFiles: generate the final product.

FILES INCLUDED

In the input file are present the following files and folder:

  • cofactors: .csv file including name and inchi of some cofactors, to remove them from the reactions.

INSTALLATION, REQUIREMENTS AND TEST APPLICATION

Set up the environment:

conda create -n p2 -c conda-forge -c bioconda rdkit pubchempy bioservices
conda activate p2
pip install kcfconvoy
conda install -c anaconda pandas scikit-learn
conda install -c conda-forge networkx=2.5

Optional, if intel CPU, install performance enhancements:

conda install -c conda-forge scikit-learn-intelex

Once downloaded the PROXIMAL2 folder, to run the algorithm download the RetroRules database (https://retrorules.org/dl/retrorules_dump) and extract in the input folder.

TEST

To run the algorithm with the test files, run the runPROX2.py files as it is. The products will be saved in the test/TESToutput/products folder.

Note that some reactions will take a couple of minutes to build the operators.

APPLICATION OF OTHER DATASETS

To investigate the promiscuity of other reactions, defines the inputs as explained in the previous INPUT section, comment the lines from 16 to 22 and modify the line 29 of runPROX2.py as following:

if True:

Releases

No releases published

Packages

No packages published

Languages