Read more about the method in our [paper] .
GenerateInput.py: script to generate the metabolite and reaction files to generate the operators. The queries need to be created separately and indipendently. It analyzes the reactions and remove the cofactors from the substrates and products. If a reaction represents a transformations just between cofactors or all the substrates or all the products are cofactors, the reaction will be removed. To annotate the metabolites in chemical structures there are few databases that are parsed: PubChem, HMDB, KEGG, MetaNetX, RetroRules.
The function takes as INPUT two files, that should be saved within the input folder:
- reactions: .csv file of the reactions of interest. Tabulator as separator. It must have the following columns: "id", "formula", "EC".
- metabolites: .csv file of the metabolites that can be involved in the reactions. Tabulator as separator. It must have the following columns, even if empty: "name", "hmdb", "kegg", "metanetx". If the file is empty or a metabolite in the reactions is not present in the file, it will be used just the name from the reaction and the eventual information present in RetroRules, otherwise it will be excluded.
The OUTPUT of the function are saved in the input folder as well, labeled as:
- reachableMolecules.csv: .csv containing the structures of the molecules included in the templates.
- templateReactions.csv: .csv having the definition of the reaction templates.
runPROX2.py: main script to run that import all the needed functions
In proximal_functions folder there are 4 code files:
- Operators2: to create the operators.
- Products2: to check within the query the possibility to apply the operators.
- GenerateMolFiles2: to create the predicted products.
- Common2: collect few functions needed in the other steps.
In the APPLICATION FILES section there is the files import.
- molecules_of_interest = queries of interest. Defined in a .csv file with no header and the tabulator as separator. The algorithm uses Smiles. If the query is expressed by InChI, the algorithm already implemented the generation of the Smiles (comment the line if not needed).
- metabolites = molecules included in the reaction templates with their structure defined as Smiles. Generated by GenerateInput.py script.
- reaction_list = list of reactions of interest. Generated by GenerateInput.py script.
Moreover, there is the definition of the pathways for the outputs (operators and products) and the final list of compound pairs:
- OP_CACHE_DIRECTORY: pathway to operator output folder.
- OUTPUT_DIRECTORY: pathway to product output folder.
- path_finalReactions: pathway where to store the final list of compound pairs.
There will be create a folder for any query of interest label with the generated ID "MetX", where X represents the number corresponding to the counting of the queries. Within the folder of the query, there will be created the folders containing the products related to any applied pair.
Any product output within the folder of the related pair is defined in json format, containing the following fields:
- GeneratedProduct:
- smiles: may be not generated. It means the algorithm can apply the modifications but RDKit do not find chemical sense about the prediction.
- mol: mol text generated throughout the algorithm.
- TemplateReaction:
- ec: enzyme related to the template reaction pair.
- ID: reaction IDs used for the prediction.
- Substrate: Substrate of the used pair.
- Product: Product of the used pair.
- QueryInformation:
- name: name of the molecule query.
- ID: ID of the molecule query.
- smiles: original Smiles of the query.
- ExtractPairs: pair extraction to generate the proper association and redundancy removal (in the input folder will be saved the definitive list of reaction pairs.)
- GenerateOperators: operators generation.
- GenerateProducts: check possible application to query.
- GenerateMolFiles: generate the final product.
In the input file are present the following files and folder:
- cofactors: .csv file including name and inchi of some cofactors, to remove them from the reactions.
Set up the environment:
conda create -n p2 -c conda-forge -c bioconda rdkit pubchempy bioservices
conda activate p2
pip install kcfconvoy
conda install -c anaconda pandas scikit-learn
conda install -c conda-forge networkx=2.5
Optional, if intel CPU, install performance enhancements:
conda install -c conda-forge scikit-learn-intelex
Once downloaded the PROXIMAL2 folder, to run the algorithm download the RetroRules database (https://retrorules.org/dl/retrorules_dump) and extract in the input folder.
To run the algorithm with the test files, run the runPROX2.py files as it is. The products will be saved in the test/TESToutput/products folder.
Note that some reactions will take a couple of minutes to build the operators.
To investigate the promiscuity of other reactions, defines the inputs as explained in the previous INPUT section, comment the lines from 16 to 22 and modify the line 29 of runPROX2.py as following:
if True: