Index
+ +A
+B
++ | + |
C
+D
+E
++ | + |
F
+G
++ | + |
H
+I
+J
++ | + |
K
++ | + |
L
++ | + |
M
+N
+O
+P
+Q
++ | + |
R
+S
+T
+U
++ | + |
V
++ |
W
++ |
' + + '' + + _("Hide Search Matches") + + "
" + ) + ); + }, + + /** + * helper function to hide the search marks again + */ + hideSearchWords: () => { + document + .querySelectorAll("#searchbox .highlight-link") + .forEach((el) => el.remove()); + document + .querySelectorAll("span.highlighted") + .forEach((el) => el.classList.remove("highlighted")); + localStorage.removeItem("sphinx_highlight_terms") + }, + + initEscapeListener: () => { + // only install a listener if it is really needed + if (!DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS) return; + + document.addEventListener("keydown", (event) => { + // bail for input elements + if (BLACKLISTED_KEY_CONTROL_ELEMENTS.has(document.activeElement.tagName)) return; + // bail with special keys + if (event.shiftKey || event.altKey || event.ctrlKey || event.metaKey) return; + if (DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS && (event.key === "Escape")) { + SphinxHighlight.hideSearchWords(); + event.preventDefault(); + } + }); + }, +}; + +_ready(() => { + /* Do not call highlightSearchWords() when we are on the search page. + * It will highlight words from the *previous* search query. + */ + if (typeof Search === "undefined") SphinxHighlight.highlightSearchWords(); + SphinxHighlight.initEscapeListener(); +}); diff --git a/genindex.html b/genindex.html new file mode 100644 index 0000000..bdc7d09 --- /dev/null +++ b/genindex.html @@ -0,0 +1,1634 @@ + + + + + + ++ | + |
+ | + |
+ | + |
+ | + |
+ | + |
+ | + |
+ | + |
+ | + |
+ |
+ |
rxnutils is a collection of routines for working with reactions, reaction templates and template extraction
+The package is divided into (currently) three sub-packages:
+chem - chemistry routines like template extraction or reaction cleaning
data - routines for manipulating various reaction data sources
pipeline - routines for building and executing simple pipelines for modifying and analyzing reactions
routes - routines for handling synthesis routes
Auto-generated API documentation is available, as well as guides for common tasks. See the menu to the left.
+For most users it is as simple as
+pip install reaction-utils
+
For developers, first clone the repository using Git.
+Then execute the following commands in the root of the repository
+conda env create -f env-dev.yml
+conda activate rxn-env
+poetry install
+
the rxnutils package is now installed in editable mode.
+Lastly, make sure to install pre-commits that are run on every commit
+pre-commit install
+
Some old RDKit wheels on pypi did not include the Contrib folder, preventing the usage of the rdkit_RxnRoleAssignment action
The pipeline for the Open reaction database requires some additional dependencies, see the documentation for this pipeline
Using the data piplines for the USPTO and Open reaction database requires you to setup a second python environment
The RInChI capabilities are not supported on MacOS
rxnutils
contain two pipelines that together imports and prepares the reaction data from the Open reaction database so that it can be used on modelling.
It is a complete end-to-end pipeline that is designed to be transparent and reproducible.
+The reason the pipeline is divided into two blocks is because the dependencies of the atom-mapper package (rxnmapper
) is incompatible with
+the dependencies rxnutils
package. Therefore, to be able to use to full pipeline, you need to setup two python environment.
Install rxnutils
according to the instructions in the README-file
Install the ord-schema
package in the `` rxnutils`` environment
++conda activate rxn-env +python -m pip install ord-schema
+
Download/Clone the ord-data
repository according to the instructions here: https://github.com/Open-Reaction-Database/ord-data
++git clone https://github.com/open-reaction-database/ord-data.git .
+
Note down the path to the repository as this needs to be given to the preparation pipeline
+Install rxnmapper
according to the instructions in the repo: https://github.com/rxn4chemistry/rxnmapper
conda create -n rxnmapper python=3.6 -y
+conda activate rxnmapper
+conda install -c rdkit rdkit=2020.03.3.0
+python -m pip install rxnmapper
+
Install Metaflow
and rxnutils
in the new environment
python -m pip install metaflow
+python -m pip install --no-deps --ignore-requires-python .
+
Create a folder for the ORD data and in that folder execute this command in the rxnutils
environment
conda activate rxn-env
+python -m rxnutils.data.ord.preparation_pipeline run --nbatches 200 --max-workers 8 --max-num-splits 200 --ord-data ORD_DATA_REPO_PATH
+
and then in the environment with the rxnmapper
run
conda activate rxnmapper
+python -m rxnutils.data.mapping_pipeline run --data-prefix ord --nbatches 200 --max-workers 8 --max-num-splits 200
+
The -max-workers
flag should be set to the number of CPUs available.
On 8 CPUs and 1 GPU the pipeline takes a couple of hours.
+The pipelines creates a number of tab-separated CSV files:
++++
+- +
ord_data.csv is the imported ORD data
- +
ord_data_cleaned.csv is the cleaned and filter data
- +
ord_data_mapped.csv is the atom-mapped, modelling-ready data
Ignore extended SMILES information in the SMILES strings
Remove molecules not sanitizable by RDKit
Remove reactions without any reactants or products
Move all reagents to reactants
Remove the existing atom-mapping
Remove reactions with more than 200 atoms when summing reactants and products
(the last is a requisite for rxnmapper
that was trained on a maximum token size roughly corresponding to 200 atoms)
The ord_data_mapped.csv
files will have the following columns:
+++
+- +
ID - unique ID from the original database
- +
Dataset - the name of the dataset from which this is reaction is taken
- +
Date - the date of the experiment as given in the database
- +
ReactionSmiles - the original reaction SMILES
- +
Yield - the yield of the first product of the first outcome, if provided
- +
ReactionSmilesClean - the reaction SMILES after cleaning
- +
BadMolecules - molecules not sanitizable by RDKit
- +
ReactantSize - number of atoms in reactants
- +
ProductSize - number of atoms in products
- +
mapped_rxn - the mapped reaction SMILES
- +
confidence - the confidence of the mapping as provided by
rxnmapper
rxnutils
provide a simple pipeline to perform simple tasks on reaction SMILES and templates in a CSV-file.
The pipeline works on tab-separated CSV files (TSV files)
+To exemplify the pipeline capabilities, we will have a look at the pipeline used to clean the USPTO data.
+The input to the pipeline is a simple YAML-file that specifies each action to take. The actions will be executed +sequentially, one after the other and each action takes a number of input arguments.
+This is the YAML-file used to clean the USPTO data:
+trim_rxn_smiles:
+ in_column: ReactionSmiles
+ out_column: ReactionSmilesClean
+remove_unsanitizable:
+ in_column: ReactionSmilesClean
+ out_column: ReactionSmilesClean
+reagents2reactants:
+ in_column: ReactionSmilesClean
+ out_column: ReactionSmilesClean
+remove_atom_mapping:
+ in_column: ReactionSmilesClean
+ out_column: ReactionSmilesClean
+reactantsize:
+ in_column: ReactionSmilesClean
+productsize:
+ in_column: ReactionSmilesClean
+query_dataframe1:
+ query: "ReactantSize>0"
+query_dataframe2:
+ query: "ProductSize>0"
+query_dataframe3:
+ query: "ReactantSize+ProductSize<200"
+
The first action is called trim_rxn_smiles
and two arguments are given: in_column
specifying which column to use as input and out_column
specifying which column
+to use as output.
The following actions remove_unsanitizable
, reagents2reactants
, remove_atom_mapping
, reactantsize
, productsize
works the same way, but might use other columns to specified for output.
The last three actions are actually the same action but executed with different arguments. They therefore have to be postfixed with 1, 2 and 3.
+The action query_dataframe
takes a query
argument and removes a number of rows not matching the query.
If we save this to clean_pipeline.yml
and given that we have a tab-separated file with USPTO data called uspto_data.csv
we can run the following command
python -m rxnutils.pipeline.runner --pipeline clean_pipeline.yml --data uspto_data.csv --output uspto_cleaned.csv
+
or we can alternatively run it from a python method like this
+from rxnutils.pipeline.runner import main as validation_runner
+
+validation_runner(
+ [
+ "--pipeline",
+ "clean_pipeline.yml",
+ "--data",
+ "uspto_data.csv",
+ "--output",
+ "uspto_cleaned.csv",
+ ]
+)
+
To find out what actions are available, you can type
+python -m rxnutils.pipeline.runner --list
+
New actions can easily be added to the pipeline framework. All of the actions are implemented in one of four modules
++++
+- +
rxnutils.pipeline.actions.dataframe_mod
- actions that modify the dataframe, e.g., removing rows or columns- +
rxnutils.pipeline.actions.reaction_mod
- actions that modify reaction SMILES- +
rxnutils.pipeline.actions.dataframe_props
- actions that compute properties from reaction SMILES- +
rxnutils.pipeline.actions.templates
- actions that process reaction templates
To exemplify, let’s have a look at the productsize
action
@action
+@dataclass
+class ProductSize:
+"""Action for counting product size"""
+
+pretty_name: ClassVar[str] = "productsize"
+in_column: str
+out_column: str = "ProductSize"
+
+def __call__(self, data: pd.DataFrame) -> pd.DataFrame:
+ smiles_col = global_apply(data, self._row_action, axis=1)
+ return data.assign(**{self.out_column: smiles_col})
+
+def __str__(self) -> str:
+ return f"{self.pretty_name} (number of heavy atoms in product)"
+
+def _row_action(self, row: pd.Series) -> str:
+ _, _, products = row[self.in_column].split(">")
+ products_mol = Chem.MolFromSmiles(products)
+
+ if products_mol:
+ product_atom_count = products_mol.GetNumHeavyAtoms()
+ else:
+ product_atom_count = 0
+
+ return product_atom_count
+
The action is defined as a class ProductSize
that has two class-decorators.
+The first @action
will register the action in a global action list and second @dataclass
is dataclass decorator from the standard library.
+The pretty_name
class variable is used to identify the action in the pipeline, that is what you are specifying in the YAML-file.
+The other two in_column
and out_column
are the arguments you can specify in the YAML file for executing the action, they can have default
+values in case they don’t need to be specified in the YAML file.
When the action is executed by the pipeline the __call__
method is invoked with the current Pandas dataframe as the only argument. This method
+should return the modified dataframe.
Lastly, it is nice to implement a __str__
method which is used by the pipeline to print useful information about the action that is executed.
rxnutils
contains routines to analyse synthesis routes. There are a number of readers that can be used to read routes from a number of
+formats, and there are routines to score the different routes.
The simplest route format supported is a text file, where each reaction is written as a reaction SMILES in a line. +Routes are separated by new-line
+For instance:
+CC(C)N.Clc1cccc(Nc2ccoc2)n1>>CC(C)Nc1cccc(Nc2ccoc2)n1
+Brc1ccoc1.Nc1cccc(Cl)n1>>Clc1cccc(Nc2ccoc2)n1
+
+Nc1cccc(NC(C)C)n1.Brc1ccoc1>>CC(C)Nc1cccc(Nc2ccoc2)n1
+CC(C)N.Nc1cccc(Cl)n1>>Nc1cccc(NC(C)C)n1
+
If this is saved to routes.txt
, these can be read into route objects with
from rxnutils.routes.readers import read_reaction_lists
+routes = read_reaction_lists("reactions.txt")
+
If you have an environment with rxnmapper
installed and the NextMove software namerxn
in your PATH then you can
+add atom-mapping and reaction classes to these routes with
# This can be set on the command-line as well
+import os
+os.environ["RXNMAPPER_ENV_PATH"] = "/home/username/miniconda/envs/rxnmapper/"
+
+for route in routes:
+ route.assign_atom_mapping(only_rxnmapper=True)
+routes[1].remap(routes[0])
+
The last line of code also make sure that the second route shares mapping with the first route.
+Other readers are available
+read_aizynthcli_dataframe
- for reading routes from aizynthcli output dataframe
read_reactions_dataframe
- for reading routes stored as reactions in a dataframe
For instance, to read routes from a dataframe with reactions. You can do something like what follows.
+The dataframe has column reaction_smiles
that holds the reaction SMILES, and the individual routes
+are identified by a target_smiles
and route_id
column. The dataframe also has a column classification
,
+holding the NextMove classification. The dataframe is called data
.
from rxnutils.routes.readers import read_reactions_dataframe
+routes = read_reactions_dataframe(
+ data,
+ "reaction_smiles",
+ group_by=["target_smiles", "route_id"],
+ metadata_columns=["classification"]
+)
+
Routines for augmenting chemical reactions
+Augment single-reactant reaction with additional reagent if possible +based on the classification of the reaction +:param smiles: the reaction SMILES to augment +:param classification: the classification of the reaction or an empty string +:return: the processed SMILES
+smiles (str)
classification (str)
str
+Wrapper class for the CGRTools library
+Bases: object
The Condensed Graph of Reaction (CGR) representation of a reaction
+reaction_container – the CGRTools container of the reaction
cgr_container – the CGRTools container of the CGR
reaction (ChemicalReaction) – the reaction composed of RDKit molecule to start from
+ValueError – if it is not possible to create the CGR from the reaction
+Returns the number of broken bonds in the reaction
+Returns the number of broken or formed bonds in the reaction
+Returns the number of formed bonds in the reaction
+Returns the number of atom and bond centers in the reaction
+Returns the chemical distance between two reactions, i.e. the absolute difference +between the total number of centers.
+Used for some atom-mapping comparison statistics
+other (CondensedGraphReaction) – the reaction to compare to
+the computed distance
+int
+Module containing a class to handle chemical reactions
+Bases: Exception
Custom exception raised when failing operations on a chemical reaction
+Bases: object
Representation of chemical reaction
+smiles (str) – the reaction SMILES
id – an optional database ID of the reaction
clean_smiles (bool) – if True, will standardize the reaction SMILES
id_ (str)
Gives all the agents as strings
+Gives the canonical (forward) template
+Gives all products as strings
+Gives pseudo RInChI
+Gives a pseudo reaction InChI key
+Gives a reaction hashkey based on Reaction SMILES & reaction id.
+Gives all reactants as strings
+Gives the retro template
+Gives the reaction InChI
+Gives the long reaction InChI key
+Gives the short reaction InChI key
+Extract un-mapped product atoms as extra ractant fragments
+Extracts the forward(canonical) and retro reaction template with the specified radius.
+https://github.com/connorcoley/ochem_predict_nn/blob/master/data/generate_reaction_templates.py +https://github.com/connorcoley/rdchiral/blob/master/templates/template_extractor.py
+radius (int) – the radius refers to the number of atoms away from the reaction +centre to be extracted (the enivronment) i.e. radius = 1 (default) +returns the first neighbours around the reaction centre
expand_ring (bool) – if True will include all atoms in the same ring as the reaction centre in the template
expand_hetero (bool) – if True will extend the template with all bonded hetero atoms
the canonical and retrosynthetic templates
+Tuple[ReactionTemplate, ReactionTemplate]
+Check product atom mapping.
+bool
+Check that the product is not among the reactants
+bool
+Checks to see if the product appears in the reactant set.
+Compares InChIs to rule out possible variations in SMILES notation.
+True the product is present in the reactants set, else False
+bool
+Checks to see if there is fuzziness in the reaction.
+True if there is fuzziness, False otherwise
+bool
+Checks if the reactant and product mol objects can be sanitized in RDKit.
+The actualy sanitization is carried out when the reaction is instansiated, +this method will only check that all molecules objects were created.
+True if all the molecule objects were successfully created, else False
+bool
+Checks whether the canonical template produces
+bool
+Checks whether the retrosynthetic template produces an outcome
+bool
+Checks whether the recorded reactants belong to the set of generated precursors.
+selectivity, i.e. the fraction of generated precursors matching the recorded precursors +i.e. 1.0 - match or match.match or match.match.match etc.
+++ +0.5 - match.none or match.none.match.none etc. +0.0 - none
+
float
+Module containing useful representations of templates
+Bases: object
Representation of a molecule created from a SMARTS string
+rd_mol (Mol) – the RDKit molecule to be represented by this class
smarts (str)
Generate the atom object of this molecule
+the next atom object
+Iterator[Atom]
+Calculate invariants on similar properties as in RDKit but ignore mass and add aromaticity
+a list of the atom invariants
+List[int]
+Return a dictionary with atomic properties
+import pandas +pandas.DataFrame(my_mol.atom_properties())
+Dict[str, List[object]]
+Calculate the unique fingerprint bits
+Will sanitize molecule if necessary
+radius (int) – the radius of the Morgan calculation
use_chirality (bool) – determines if chirality should be taken into account
the set of unique bits
+Set[int]
+Calculate the finger bit vector
+Will sanitize molecule if necessary
+radius (int) – the radius of the Morgan calculation
nbits (int) – the length of the bit vector
use_chirality (bool) – determines if chirality should be taken into account
the bit vector
+ndarray
+Copy over some properties from the SMARTS specification to the atom object +1. Set IsAromatic flag is lower-case a is in the SMARTS +2. Fix formal charges +3. Explicit number of hydrogen atoms
+Also extract explicit degree from SMARTS and is stored in +the comp_degree property.
+None
+Create a hash of the template based on a cleaned-up template SMILES string
+the hash string
+str
+Create a hash of the template based on a cleaned-up template SMARTS string
+the hash string
+str
+Remove the atom mappings from the molecule
+None
+Will do selective sanitation - skip some procedures that causes problems due to “hanging” aromatic atoms
+SANITIZE_ADJUSTHS +SANITIZE_ALL +SANITIZE_CLEANUP +SANITIZE_CLEANUPCHIRALITY +SANITIZE_FINDRADICALS +SANITIZE_KEKULIZE +SANITIZE_NONE +SANITIZE_PROPERTIES +SANITIZE_SETAROMATICITY +SANITIZE_SETCONJUGATION +SANITIZE_SETHYBRIDIZATION +SANITIZE_SYMMRINGS
+None
+Bases: object
Representation of a reaction template created with RDChiral
+smarts (str) – the SMARTS string representation of the reaction
direction (str) – if equal to “retro” reverse the meaning of products and reactants
Applies the template on the given molecule
+mols (str) – the molecule as a SMILES
+the list of reactants
+Tuple[Tuple[str, …], …]
+Calculate the difference count of the fingerprint bits set of the reactants and products
+radius (int) – the radius of the Morgan calculation
use_chirality (bool) – determines if chirality should be taken into account
a dictionary of the difference count for each bit
+Dict[int, int]
+Calculate the difference fingerprint vector
+radius (int) – the radius of the Morgan calculation
nbits (int) – the length of the bit vector
use_chirality (bool) – determines if chirality should be taken into account
the bit vector
+ndarray
+Create a hash of the template based on the difference counts of the fingerprint bits
+radius (int) – the radius of the Morgan calculation
use_chirality (bool) – determines if chirality should be taken into account
the hash string
+str
+Create a hash of the template based on a cleaned-up template SMILES string
+the hash string
+str
+Create a hash of the template based on a cleaned-up template SMARTS string
+the hash string
+str
+Checks if the template is valid in RDKit
+bool
+Module containing various chemical utility routines
+Given an RDKit molecule, this function returns a list of tuples, where +each tuple contains the AtomIdx’s for a special group of atoms which should +be included in a fragment all together. This should only be done for the +reactants, otherwise the products might end up with mapping mismatches +We draw a distinction between atoms in groups that trigger that whole +group to be included, and “unimportant” atoms in the groups that will not +be included if another atom matches.
+List[Tuple[Tuple[int, …], Tuple[int, …]]]
+Returns True if a molecule has atom mapping, else False.
+smiles (str) – the SMILES/SMARTS representing the molecule
is_smarts (bool) – if True, will interpret the SMILES as a SMARTS
sanitize (bool) – if True, will sanitize the molecule
True if the SMILES string has atom-mapping, else False
+bool
+Returns a molecule without atom mapping
+smiles (str) – the SMILES/SMARTS representing the molecule
is_smarts (bool) – if True, will interpret the SMILES as a SMARTS
sanitize (bool) – if True, will sanitize the molecule
canonical (bool) – if False, will not canonicalize (applies to SMILES)
the molecule without atom-mapping
+str
+Remove atom mapping from a template SMARTS string
+template_smarts (str)
+str
+Neutralize a set of molecules using RDKit routines
+smiles_list (List[str]) – the molecules as SMILES
+the neutralized molecules
+List[str]
+Remove salts from a set of molecules using RDKit routines
+smiles_list (List[str]) – the molecules as SMILES
keep_something (bool) – if True will keep at least one salt
the desalted molecules
+List[str]
+Test if two molecules are the same. +First number of atoms and bonds are compared to guard the potentially more expensive +substructure match. If mol1 is a substructure of mol2 and vice versa, the molecules +are considered to be the same.
+mol1 – First molecule
mol2 – Second molecule for comparison
if the molecules match
+bool
+Return the numbers in the atom mapping
+smiles (str) – the molecule as SMILES
+the atom mapping numbers
+List[int]
+Reassign reaction’s atom mapping. +Remove atom maps for atoms in reactants and reactents not found in product’s atoms.
+rsmi (str) – Reaction SMILES
as_smiles (bool) – Return reaction SMILES or SMARTS, defaults to False
Reaction SMILES or SMARTS
+str
+Join a part of reaction SMILES, e.g. reactants and products into components. +Intra-molecular complexes are bracketed with parenthesis
+smiles_list (List[str]) – the SMILES components
+the joined list
+str
+Split a part of reaction SMILES, e.g. reactants or products +into components. Taking care of intra-molecular complexes
+Taken from RDKit: +https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/ChemReactions/DaylightParser.cpp
+smiles (str) – the SMILES/SMARTS
+the individual components.
+List[str]
+Return reaction centre atoms, provided that the bonding partners +actually change when comparing the environment in the reactant and the product
+inspired by code from Greg Landrum’s tutorial +set up array to remove atoms from the reaction centers +by comparing the atom mapping in the reactant vs the products
+Original implementation by Christoph Bauer
+rxn (ChemicalReaction) – the initialized RDKit reaction
+tuple of reaction centre atoms, filtered by connectivity criterion
+Tuple[List[int], …]
+Module for downloading InChI Trust Reaction InChI.
+Bases: Exception
Exception raised by RInChI API
+Check if Reaction InchI application is present. +Download it if it’s required to do so.
+Path of the folder containing the appropriate +command line executable based on system type.
+str
+Module containing an API to the Reaction InChI program
+alias of RInChI
Generate RInChI from Reaction SMILES.
+reaction_smiles (str) – Reaction SMILES
+RInChIError – When there is an error with RInChI generation.
+Namedtuple with the generated RInChI.
+RInChI
+Module containing base class for data pipelines
+Bases: FlowSpec
Base-class for pipelines for processing data
+Bases: DataBaseFlow
Base pipeline for preparing datasets and doing clean-up
+Count and return the number of lines in a file
+filename (str)
+int
+filename (str)
nbatches (int)
read_func (Any)
write_func (Any)
combine_func (Any)
None
+Combine CSV batches to one master file
+The batch files are removed from disc
+filename (str) – the filename of the master file
nbatches (int) – the number of batches
None
+Combine sparse matrix batches to one master file
+The batch files are removed from disc
+filename (str) – the filename of the master file
nbatches (int) – the number of batches
None
+Create batches for reading a splitted CSV-file
+Batch index
Start index
End index
filename (str) – the CSV file to make batches of
nbatches (int) – the number of batches
output_filename (str | None)
the created batches
+List[Tuple[int, int, int]]
+Read parts of a CSV file as specified by a batch
+filename (str) – the path to the CSV file on disc
batch (Tuple[int, ...] | None) – the batch specification as returned by create_csv_batches
kwargs (Any)
DataFrame
+Module containing script to atom-map USPTO or ORD reactions
+Function for command-line tool
+input_args (Sequence[str] | None)
+None
+Module containing pipeline for mapping with rxnmapper +This needs to be run in an environment with rxnmapper installed
+Bases: DataBaseFlow
Pipeline for atom-map USPTO or ORD data with rxnmapper
+Setup batches for mapping
+Perform atom-mapping of reactions
+Join batches from mapping
+Final step, just print information
+Module containing script to import ORD dataset to a CSV file
+Function for command-line tool
+input_args (Sequence[str] | None)
+None
+Module containing pipeline for extracting, transforming and cleaning Open reaction database data +This needs to be run in an environment with rxnutils installed
+Bases: DataPreparationBaseFlow
Pipeline for extracting ORD data and do some clean-up
+Import ORD data
+Setup cleaning
+Perform cleaning of data
+Combined cleaned batches of data
+Final step, just print information
+Module containing script to combine raw USPTO files
+preserve the ReactionSmiles and Year columns
create an ID from PatentNumber and ParagraphNum and row index in the original file
Function for command-line tool
+args (Sequence[str] | None)
+None
+Module containing a script to download USPTO files Figshare
+Function for command-line tool
+args (Sequence[str] | None)
+None
+Module containing pipeline for downloading, transforming and cleaning USPTO data +This needs to be run in an environment with rxnutils installed
+Bases: DataPreparationBaseFlow
Pipeline for download UPSTO source file, combining them and do some clean-up
+Download USPTO data from Figshare
+Combine USPTO data files and add IDs
+Setup cleaning
+Perform cleaning of data
+Combined cleaned batches of data
+Final step, just print information
+Code for curating USPTO yields.
+Inspiration from this code: https://github.com/DocMinus/Yield_curation_USPTO
+This could potentially be an action, but since it only make sens to use it +with USPTO data, it resides here for now.
+Bases: object
Action for curating USPTO yield columns
+text_yield_column (str)
calc_yield_column (str)
out_column (str)
ReactionException
ChemicalReaction
ChemicalReaction.agents_list
ChemicalReaction.canonical_template
ChemicalReaction.products_list
ChemicalReaction.pseudo_rinchi
ChemicalReaction.pseudo_rinchi_key
ChemicalReaction.hashed_rid
ChemicalReaction.reactants_list
ChemicalReaction.retro_template
ChemicalReaction.rinchi
ChemicalReaction.rinchi_key_long
ChemicalReaction.rinchi_key_short
ChemicalReaction.generate_coreagent()
ChemicalReaction.generate_reaction_template()
ChemicalReaction.has_partial_mapping()
ChemicalReaction.is_complete()
ChemicalReaction.no_change()
ChemicalReaction.is_fuzzy()
ChemicalReaction.sanitization_check()
ChemicalReaction.canonical_template_generate_outcome()
ChemicalReaction.retro_template_generate_outcome()
ChemicalReaction.retro_template_selectivity()
TemplateMolecule
TemplateMolecule.atoms()
TemplateMolecule.atom_invariants()
TemplateMolecule.atom_properties()
TemplateMolecule.fingerprint_bits()
TemplateMolecule.fingerprint_vector()
TemplateMolecule.fix_atom_properties()
TemplateMolecule.hash_from_smiles()
TemplateMolecule.hash_from_smarts()
TemplateMolecule.remove_atom_mapping()
TemplateMolecule.sanitize()
ReactionTemplate
+SynthesisRoute
SynthesisRoute.mapped_root_smiles
SynthesisRoute.nsteps
SynthesisRoute.atom_mapped_reaction_smiles()
SynthesisRoute.assign_atom_mapping()
SynthesisRoute.chains()
SynthesisRoute.image()
SynthesisRoute.is_solved()
SynthesisRoute.leaves()
SynthesisRoute.reaction_data()
SynthesisRoute.reaction_ngrams()
SynthesisRoute.reaction_smiles()
SynthesisRoute.remap()
smiles2inchikey()
Module containing actions that modify the dataframe in some way
+Bases: object
Drops columns specified in ‘columns’
+yaml example:
+NRingChange
RingBondMade
columns (list[str])
+Bases: object
Action for dropping duplicates from dataframe
+key_columns (list[str])
+Bases: object
Drops rows according to boolean in ‘indicator_columns’. +True => Keep, False => Drop.
+yaml example:
+is_sanitizable
is_sanitizable2
indicator_columns (list[str])
+Bases: object
Drops columns not specified in ‘columns’
+yaml example:
+id
classification
rsmi_processed
columns (list[str])
+Bases: object
Renames columns specified in ‘in_columns’ to the names specified in ‘out_columns’
+yaml example:
+column1
column2
column1_renamed
column2_renamed
in_columns (list[str])
out_columns (list[str])
Bases: object
Uses dataframe query to produce a new (smaller) dataframe. Query must conform to pandas.query()
+yaml file example: Keeping only rows where the has_stereo columns is True:
+query: has_stereo == True
+query (str)
+Bases: object
Stacks the specified in_columns under a new column name (out_column), +multiplies the rest of the columns as appropriate
+yaml control file example:
+rsmi_processed
rsmi_inverted_stereo
out_column: rsmi_processed
+in_columns (list[str])
out_column (str)
Bases: object
Stacks the specified target_columns on top of the stack_columns after renaming.
+Example Yaml:
+rsmi_inverted_stereo
PseudoHash_inverted_stereo
rsmi_processed
PseudoHash
stack_columns (list[str])
target_columns (list[str])
Module containing actions on reactions that modify the reaction in some way
+Bases: ReactionActionMixIn
Action for desalting molecules
+in_column (str)
out_column (str)
keep_something (bool)
Bases: object
Action for calling namrxn
+in_column (str)
options (str)
nm_rxn_column (str)
nmc_column (str)
Bases: ReactionActionMixIn
Action for neutralizing molecules
+in_column (str)
out_column (str)
Bases: ReactionActionMixIn
Action for converting reactants to reagents
+in_column (str)
out_column (str)
Bases: object
Action for converting reagents to reactants
+in_column (str)
out_column (str)
Bases: ReactionActionMixIn
Action for removing all atom mapping
+in_column (str)
out_column (str)
Bases: object
Action for removing stero information
+in_column (str)
out_column (str)
Bases: object
Action for inverting stero information
+in_column (str)
out_column (str)
Bases: object
Action creating and modifying isotope information
+in_column (str)
isotope_column (str)
out_column (str)
match_regex (str)
sub_regex (str)
Bases: ReactionActionMixIn
Action for removing extra atom mapping
+in_column (str)
out_column (str)
Bases: ReactionActionMixIn
Compares the products with the reagents and reactants and remove unchanged products.
+Protonation is considered a difference, As example, if there’s a HCl in the reagents and +a Cl- in the products, it will not be removed.
+in_column (str)
out_column (str)
Bases: ReactionActionMixIn
Action for removing unsanitizable reactions
+in_column (str)
out_column (str)
bad_column (str)
Bases: object
Action for assigning roles based on RDKit algorithm
+in_column (str)
out_column (str)
Bases: object
Action for mapping reactions with the RXNMapper tool
+in_column (str)
out_column (str)
rxnmapper_command (str)
Bases: ReactionActionMixIn
Action for splitting reaction into components
+in_column (str)
out_columns (list[str])
Bases: ReactionActionMixIn
Action for tagging disconnection site in products with atom-map ‘[<atom>:1]’.
+in_column (str)
out_column (str)
Bases: ReactionActionMixIn
Action for converting atom-map tagging to exclamation mark tagging.
+yaml example:
+in_column_tagged: products_atom_map_tagged +in_column_untagged: products +out_column_tagged: products_tagged +out_column_reconstructed: products_reconstructed
+in_column (str)
out_column_tagged (str)
out_column_reconstructed (str)
Bases: object
Action from trimming reaction SMILES
+in_column (str)
out_column (str)
smiles_column_index (int)
Module containing actions on reactions that doesn’t modify the reactions only compute properties of them
+Bases: ReactionActionMixIn
Action for counting reaction components
+in_column (str)
nreactants_column (str)
nmapped_reactants_column (str)
nreagents_column (str)
nmapped_reagents_column (str)
nproducts_column (str)
nmapped_products_column (str)
Bases: object
Action for counting elements in reactants
+in_column (str)
out_column (str)
Bases: object
Action for checking stereo info
+in_column (str)
out_column (str)
Bases: object
Action for flagging if reaction has any unmapped radical atoms
+in_column (str)
out_column (str)
Bases: object
Action for flagging if reaction has any unsanitizable reactants
+rsmi_column (str)
bad_columns (List[str])
out_column (str)
Bases: object
Action for determining if a CGR can be created from the reaction smiles
+in_column (str)
out_column (str)
Bases: object
Action for calculating the number of dynamic bonds
+in_column (str)
out_column (str)
Bases: object
Action for collecting statistics of product atom mapping
+in_column (str)
unmapped_column (str)
widow_column (str)
Bases: object
Action for counting product size
+in_column (str)
out_column (str)
Bases: object
Action for creating a reaction hash based on InChI keys
+in_column (str)
out_column (str)
no_reagents (bool)
Bases: object
Action for creating a reaction hash based on SMILES
+in_column (str)
out_column (str)
Bases: object
Action for computing atom balance
+in_column (str)
out_column (str)
Bases: object
Action for counting reactant size
+in_column (str)
out_column (str)
Bases: object
Action for calculating the maximum number of rings in either the product or reactant +For a reaction without reactants or products, it will return 0 to enable easy arithmetic comparison
+in_column (str)
out_column (str)
Bases: object
Action for calculating if reaction has change in number of rings
+A positive number from this action implies that a ring was formed during the reaction
+in_column (str)
out_column (str)
Bases: object
Action for flagging if reaction has made a ring bond in the product
+in_column (str)
out_column (str)
Bases: object
Action for computing the size of a newly formed ring
+in_column (str)
out_column (str)
Bases: object
Action for counting SMILES length
+in_column (str)
out_column (str)
Bases: object
Action for checking if SMILES are sanitizable
+in_column (str)
out_column (str)
Bases: object
Flags reactions where non-stereo compounds (No “@”s in SMILES) +turn into stereo compounds (containing “@”)
+in_column (str)
out_column (str)
Bases: object
Action for checking if stereogenic centre in reaction center is changing +during the reaction
+A boolean column indicating True or False if it has stereochanges
A description of the stereo information before and after the reaction
in_column (str)
out_column (str)
stereo_changes_column (str)
Bases: object
Action for checking if reagent has stereo centres
+in_column (str)
out_column (str)
Bases: object
Action for checking if stereo centre is created during reaction
+in_column (str)
out_column (str)
Bases: object
Action for checking if stereo centre is removed during reaction
+in_column (str)
out_column (str)
Bases: object
Action for checking if there is a potential stereo centre in the reaction
+Do not consider changes to bond stereochemistry
+in_column (str)
out_column (str)
Bases: object
Action for checking if there is a stereo centre outside the reaction centre
+in_column (str)
out_column (str)
Bases: object
Action for checking if the product is a meso compound
+in_column (str)
out_column (str)
Module containing template validation actions
+Bases: object
Action for counting template components
+in_column (str)
nreactants_column (str)
nreagents_column (str)
nproducts_column (str)
Bases: object
Action for checking template reproduction
+template_column (str)
smiles_column (str)
expected_reactants_column (str)
other_reactants_column (str)
noutcomes_column (str)
DropColumns
+DropDuplicates
+DropRows
+KeepColumns
+RenameColumns
+QueryDataframe
+StackColumns
+StackMultiColumns
+DesaltMolecules
+NameRxn
+NeutralizeMolecules
+ReactantsToReagents
+ReagentsToReactants
+RemoveAtomMapping
+RemoveStereoInfo
+InvertStereo
+IsotopeInfo
+RemoveExtraAtomMapping
+RemoveUnchangedProducts
+RemoveUnsanitizable
+RDKitRxnRoles
+RxnMapper
+SplitReaction
+AtomMapTagDisconnectionSite
+ConvertAtomMapDisconnectionTag
+TrimRxnSmiles
+CountComponents
+CountElements
+HasStereoInfo
+HasUnmappedRadicalAtom
+HasUnsanitizableReactants
+CgrCreated
+CgrNumberOfDynamicBonds
+ProductAtomMappingStats
+ProductSize
+PseudoReactionHash
+PseudoSmilesHash
+ReactantProductAtomBalance
+ReactantSize
+MaxRingNumber
+RingNumberChange
+RingBondMade
+RingMadeSize
+SmilesLength
+SmilesSanitizable
+StereoInvention
+StereoCentreChanges
+StereoHasChiralReagent
+StereoCenterIsCreated
+StereoCenterIsRemoved
+StereoCenterInReactantPotential
+StereoCenterOutsideReaction
+StereoMesoProduct
+Module containing routines for the validation framework
+Decorator that register a callable as a validation action.
+An action will be called with a pandas.DataFrame object +and return a new pandas.DataFrame object.
+An action needs to have an attribute pretty_name.
+obj (Callable[[DataFrame], DataFrame]) – the callable to register as an action
+the same as obj.
+Callable[[DataFrame], DataFrame]
+List all available actions in a nice table
+short (bool)
+None
+Create an action that can be called
+pretty_name (str) – the name of the action
args (Any)
kwargs (Any)
the instantiated actions
+Callable[[DataFrame], DataFrame]
+Bases: object
Mixin class with standard routines for splitting and joining reaction SMILES
+Join list of components into a reaction SMILES
+reactants_list (List[str]) – the list of reactant SMILES
reagents_list (List[str]) – the list of reagent SMILES
products_list (List[str]) – the list of product SMILES
the concatenated reaction SMILES
+str
+Join component SMILES into a reaction SMILES
+reactants (str) – the reactant SMILES
reagents (str) – the reagent SMILES
products (str) – the product SMILES
the concatenated reaction SMILES
+str
+Split a reaction SMILES into list of component SMILES
+row (Series) – the row with the SMILES
+the list of SMILES of the components
+Tuple[List[str], List[str], List[str]]
+Split a reaction SMILES into components SMILES
+row (Series) – the row with the SMILES
+the SMILES of the components
+Tuple[str, str, str]
+Module containg routines and interface to run pipelines
+Run a given pipeline on a dataset
+The actions are applied sequentials as they are defined in the pipeline
+The intermediate results of the pipeline will be written to separate +tab-separated CSV files.
+data (DataFrame) – the dataset
pipeline (Dict[str, Any]) – the action specifications
filename (str) – path to the final output file
save_intermediates (bool) – if True will save intermediate results
the dataset after completing the pipeline
+DataFrame
+Function for command line argument
+args (Sequence[str] | None)
+None
+Contains a class encapsulating a synthesis route, +as well as routines for assigning proper atom-mapping +and drawing the route
+Bases: object
This encapsulates a synthesis route or a reaction tree. +It provide convinient methods for assigning atom-mapping +to the reactions, and for providing reaction-level data +of the route
+It is typically initiallized by one of the readers in the +rxnutils.routes.readers module.
+The tree depth and the forward step is automatically assigned +to each reaction node.
+The max_depth attribute holds the longest-linear-sequence (LLS)
+reaction_tree (Dict[str, Any]) – the tree structure representing the route
+Return the atom-mapped SMILES of the root compound
+Will raise an exception if the route is a just a single +compound, or if the route has not been assigned atom-mapping.
+Return the number of reactions in the route
+Returns a list of the atom-mapped reaction SMILES in the route
+List[str]
+Assign atom-mapping to each reaction in the route and +ensure that it is consistent from root compound and throughout +the route.
+It will use NameRxn to assign classification and possiblty atom-mapping, +as well as rxnmapper to assign atom-mapping in case NameRxn cannot classify +a reaction.
+overwrite (bool) – if True will overwrite existing mapping
only_rxnmapper (bool) – if True will disregard NameRxn mapping and use only rxnmapper
None
+Returns linear sequences or chains extracted from the route.
+Each chain is a list of a dictionary representing the molecules, only the most +complex molecule is kept for each reaction - making the chain a sequence of molecule +to molecule transformation.
+The first chain will be the longest linear sequence (LLS), and the second chain +will be longest branch if this is a convergent route. This branch will be processed +further, but the other branches can probably be discarded as they have not been +investigated thoroughly.
+complexity_func (Callable[[str], float]) – a function that takes a SMILES and returns a +complexity metric of the molecule
+a list of chains where each chain is a list of molecules
+List[List[Dict[str, Any]]]
+Depict the route.
+show_atom_mapping (bool) – if True, will show the atom-mapping
factory_kwargs (Dict[str, Any] | None) – additional keyword arguments sent to the RouteImageFactory
the image of the route
+Image
+Find if this route is solved, i.e. if all starting material +is in stock.
+To be accurate, each molecule node need to have an extra +boolean property called in_stock.
+bool
+Extract a set with the SMILES of all the leaf nodes, i.e. +starting material
+a set of SMILES strings
+Set[str]
+Returns a list of dictionaries for each reaction +in the route. This is metadata of the reactions +augmented with reaction SMILES and depth of the reaction
+List[Dict[str, Any]]
+Extract an n-gram representation of the route by building up n-grams +of the reaction metadata.
+nitems (int) – the length of the gram
metadata_key (str) – the metadata to extract
the collected n-grams
+List[Tuple[Any, …]]
+Returns a list of the un-mapped reaction SMILES +:param augment: if True will add reagents to single-reactant
+++reagents whenever possible
+
augment (bool)
+List[str]
+Remap the reaction so that it follows the mapping of a +1) root compound in a reference route, 2) a ref compound given +as a SMILES, or 3) using a raw mapping
+other (SynthesisRoute | str | Dict[int, int]) – the reference for re-mapping
+None
+Converts a SMILES to an InChI key
+smiles (str)
ignore_stereo (bool)
str
+Contains routines for computing route similarities
+Returns the geometric mean of the simple bond forming similarity, and +the atom matching bonanza similarity
+routes (Sequence[SynthesisRoute]) – the sequence of routes to compare
+the pairwise similarity
+ndarray
+Calculates the pairwise similarity of a sequence of routes +based on the overlap of the atom-mapping numbers of the compounds +in the routes.
+routes (Sequence[SynthesisRoute]) – the sequence of routes to compare
+the pairwise similarity
+ndarray
+Calculates the pairwise similarity of a sequence of routes +based on the overlap of formed bonds in the reactions.
+routes (Sequence[SynthesisRoute]) – the sequence of routes to compare
+the pairwise similarity
+ndarray
+Return a callable that given a list routes as dictionaries +calculate the squared distance matrix
+model (str) – the route distance model name
kwargs (Any) – additional keyword arguments for the model
the appropriate route distances calculator
+Callable[[Sequence[SynthesisRoute]], ndarray]
+This module contains a collection of routines to produce pretty images
+Create a pretty image of a molecule, +with a colored frame around it
+mol (Chem.rdchem.Mol) – the molecule
frame_color (PilColor) – the color of the frame
size (int) – the size of the image
the produced image
+PilImage
+Create pretty images of molecules with a colored frame around each one of them.
+The molecules will be resized to be of similar sizes.
+smiles_list – the molecules
frame_colors (Sequence[PilColor]) – the color of the frame for each molecule
size (int) – the sub-image size
draw_kwargs (Dict[str, Any]) – additional keyword-arguments sent to MolsToGridImage
mols (Sequence[Chem.rdchem.Mol])
the produced images
+List[PilImage]
+Crop an image by removing white space around it
+img (PilImage) – the image to crop
margin (int) – padding, defaults to 20
the cropped image
+PilImage
+Draw a rounded rectangle around an image
+img (PilImage) – the image to draw upon
color (PilColor) – the color of the rectangle
arc_size (int) – the size of the corner, defaults to 20
the new image
+PilImage
+Bases: object
Factory class for drawing a route
+route (Dict[str, Any]) – the dictionary representation of the route
in_stock_colors (FrameColors) – the colors around molecules, defaults to {True: “green”, False: “orange”}
show_all (bool) – if True, also show nodes that are marked as hidden
margin (int) – the margin between images
mol_size (int) – the size of the molecule
mol_draw_kwargs (Dict[str, Any]) – additional arguments sent to the drawing routine
replace_mol_func (Callable[[Dict[str, Any]], None]) – an optional function to replace molecule images
Routines for reading routes from various formats
+Read one or more simple lists of reactions into one or more +retrosynthesis trees.
+Each list of reactions should be separated by an empty line. +Each row of each reaction should contain the reaction SMILES (reactants>>products) +and nothing else.
+Example: +A.B>>C +D.E>>B
+A.X>>Y +Z>>X
+A
+E
+the path to the file with the reactions
+the list of the created trees
+filename (str)
+List[SynthesisRoute]
+Read routes as produced by the aizynthcli tool of the AiZynthFinder package.
+data (DataFrame) – the dataframe as output by aizynthcli
+the created routes
+Series
+Read routes from reactions stored in a pandas dataframe. The different +routes are groupable by one or more column. Additional metadata columns +can be extracted from the dataframe as well.
+The dataframe is grouped by the columns specified by group_by and +then one routes is extracted from each subset dataframe. The function +returns a series with the routes, which is indexable by the columns +in the group_by list.
+data (DataFrame) – the dataframe with reaction data
smiles_column (str) – the column with the reaction SMILES
group_by (List[str]) – the columns that uniquely identifies each route
metadata_column – additional columns to be added as metadata to each route
metadata_columns (List[str] | None)
the created series with route.
+Series
+Convert a list of reactions into a retrosynthesis tree
+This is based on matching partial InChI keys of the reactants in one +reaction with the partial InChI key of a product.
+list of reaction SMILES
+the created trees
+reactions (Sequence[str])
metadata (Sequence[Dict[str, Any]] | None)
filename (str)
+Routines for scoring synthesis routes
+Scores and sort a list of routes. +Returns a tuple of the sorted routes and their scores.
+routes (List[SynthesisRoute]) – the routes to score
scorer (Callable[[...], float]) – the scorer function
kwargs (Any) – additional argument given to the scorer
the sorted routes and their scores
+Tuple[List[SynthesisRoute], List[float]]
+Compute the rank of route scores. Rank starts at 1
+scores (List[float]) – the route scores
+a list of ranks for each route
+List[int]
+Calculate the score of route using the method from +(Badowski et al. Chem Sci. 2019, 10, 4640).
+The reaction cost is constant and the yield is an average yield. +The starting materials are assigned a cost based on whether they are in +stock or not. By default starting material in stock is assigned a +cost of 1 and starting material not in stock is assigned a cost of 10.
+To be accurate, each molecule node need to have an extra +boolean property called in_stock.
+route (SynthesisRoute) – the route to analyze
mol_costs (Dict[bool, float] | None) – the starting material cost
average_yield (float) – the average yield, defaults to 0.8
reaction_cost (float) – the reaction cost, defaults to 1.0
the computed cost
+float
+Contains routines for creating, reading, and writing n-gram collections
+Can be run as a module to create a collection from a set of routes:
+++python -m rxnutils.routes.retro_bleu.ngram_collection –filename routes.json –output ngrams.json –nitems 2 –metadata template_hash
+
Bases: object
Class to create, read, and write a collection of n-grams
+nitems (int) – the length of each n-gram
metadata_key (str) – the key used to extract the n-grams from the reactions
ngrams (Set[Tuple[str, ...]]) – the extracted n-grams
Read an n-gram collection from a JSON-file
+filename (str) – the path to the file
+the n-gram collection
+Make a n-gram collection by extracting them from a collection of +synthesis routes.
+filename (str) – the path to a file with a list of synthesis routes
nitems (int) – the length of the gram
metadata_key (str) – the metadata to extract
the n-gram collection
+Save an n-gram collection to a JSON-file
+filename (str) – the path to the file
+None
+Contains routine to score routes according to Retro-BLEU paper
+Calculate the fractional n-gram overlap of the n-grams in the given +route and the reference n-gram collection
+route (SynthesisRoute) – the route to score
ref (NgramCollection) – the reference n-gram collection
the calculated score
+float
+Calculate the Retro-BLEU score according to the paper:
+Li, Junren, Lei Fang, och Jian-Guang Lou. +”Retro-BLEU: quantifying chemical plausibility of retrosynthesis routes through reaction template sequence analysis”. +Digital Discovery 3, nr 3 (2024): 482–90. https://doi.org/10.1039/D3DD00219E.
+route (SynthesisRoute) – the route to score
ref (NgramCollection) – the reference n-gram collection
ideal_steps (int) – a length-penalization hyperparameter (see Eq 2 in ref)
the calculated score
+float
+Module contain method to compute distance matrix using TED
+Compute the TED distances between each pair of routes
+routes (Sequence[SynthesisRoute]) – the routes to calculate pairwise distance on
content (str) – determine what part of the synthesis trees to include in the calculation, +options ‘molecules’, ‘reactions’ and ‘both’, default ‘both’
timeout (int | None) – if given, raises an exception if timeout is taking longer time
the square distance matrix
+ndarray
+Module containing helper classes to compute the distance between to reaction trees using the APTED method +Since APTED is based on ordered trees and the reaction trees are unordered, plenty of +heuristics are implemented to deal with this.
+Bases: object
Wrapper for a reaction tree that can calculate distances between +trees.
+route (SynthesisRoute) – the synthesis route to wrap
content (Union[str, TreeContent]) – the content of the route to consider in the distance calculation, default ‘molecules’
exhaustive_limit (int) – if the number of possible ordered trees are below this limit create them all, default 20
fp_factory (Callable[[StrDict, Optional[StrDict]], None]) – the factory of the fingerprint, Morgan fingerprint for molecules and reactions by default
dist_func (Callable[[np.ndarray, np.ndarray], float]) – the distance function to use when renaming nodes
Return a dictionary with internal information about the wrapper
+Return the first created ordered tree
+Return a list of all created ordered trees
+Iterate over all distances computed between this and another tree
+There are three possible enumeration of distances possible dependent +on the number of possible ordered trees for the two routes that are compared
+If the product of the number of possible ordered trees for both routes are +below exhaustive_limit compute the distance between all pair of trees
If both self and other has been fully enumerated (i.e. all ordered trees has been created) +compute the distances between all trees of the route with the most ordered trees and +the first tree of the other route
Compute exhaustive_limit number of distances by shuffling the child order for +each of the routes.
The rules are applied top-to-bottom.
+other (ReactionTreeWrapper) – another tree to calculate distance to
exhaustive_limit (int) – used to determine what type of enumeration to do
the next computed distance between self and other
+Iterable[float]
+Calculate the minimum distance from this route to another route
+Enumerate the distances using distance_iter.
+other (ReactionTreeWrapper) – another tree to calculate distance to
exhaustive_limit (int) – used to determine what type of enumeration to do
the minimum distance
+float
+Compute the distance to another tree, by simpling sorting the children +of both trees. This is not guaranteed to return the minimum distance.
+other (ReactionTreeWrapper) – another tree to calculate distance to
+the distance
+float
+Module containing utilities for TED calculations
+Bases: str
, Enum
Possibilities for distance calculations on reaction trees
+Bases: Config
This is a helper class for the tree edit distance +calculation. It defines how the substitution +cost is calculated and how to obtain children nodes.
+randomize (bool) – if True, the children will be shuffled
sort_children (bool) – if True, the children will be sorted
dist_func (Callable[[np.ndarray, np.ndarray], float]) – the distance function used for renaming nodes, Jaccard by default
Calculates the cost of renaming the label of the source node +to the label of the destination node
+node1 (Dict[str, Any])
node2 (Dict[str, Any])
float
+Returns children of node
+node (Dict[str, Any])
+List[Dict[str, Any]]
+Bases: object
Calculate Morgan fingerprint for molecules, and difference fingerprints for reactions
+radius (int) – the radius of the fingerprint
nbits (int) – the fingerprint lengths
Module containing routes to validate AiZynthFinder-like input dictionaries
+Bases: BaseModel
Node representing a reaction
+data (Any)
+A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
+Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
+Metadata about the fields defined on the model, +mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
+This replaces Model.__fields__ from Pydantic V1.
+Bases: BaseModel
Node representing a molecule
+smiles (str)
type (Annotated[str, StringConstraints(strip_whitespace=None, to_upper=None, to_lower=None, strict=None, min_length=None, max_length=None, pattern=^mol$)])
children (Annotated[List[ReactionNode], Len(min_length=1, max_length=1)] | None)
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
+Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
+Metadata about the fields defined on the model, +mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
+This replaces Model.__fields__ from Pydantic V1.
+Check that the route dictionary is a valid structure
+dict – the route as dictionary
dict_ (Dict[str, Any])
None
++ Searching for multiple words only shows matches that contain + all words. +
+ + + + + + + + +reaction utils contains routines for extracting reaction templates using the RDchiral package. +This code is based on the work of Thakkar et al. (Chem. Sci., 2019) but with some re-factoring and +other additions.
+Let’s start with this atom-mapped reaction
+ +CCN(CC)CC.CCOCC.Cl[S:3]([CH2:2][CH3:1])(=[O:4])=[O:5].[OH:6][CH2:7][CH2:8][Br:9]>>[CH3:1][CH2:2][S:3](=[O:4])(=[O:5])[O:6][CH2:7][CH2:8][Br:9]
+
First we create a ChemicalReaction
object that is encapsulating the reaction and provides some
+simple curation routines.
from rxnutils.chem.reaction import ChemicalReaction
+
+reaction = "CCN(CC)CC.CCOCC.Cl[S:3]([CH2:2][CH3:1])(=[O:4])=[O:5].[OH:6][CH2:7][CH2:8][Br:9]>>[CH3:1][CH2:2][S:3](=[O:4])(=[O:5])[O:6][CH2:7][CH2:8][Br:9]"
+rxn = ChemicalReaction(reaction)
+
if you inspect the reactants_list
property, you will see that two of the reactants from the reaction
+SMILES have been moved to the list of agents because they are not mapped.
rxn.reactants_list
+>> ['Cl[S:3]([CH2:2][CH3:1])(=[O:4])=[O:5]', '[OH:6][CH2:7][CH2:8][Br:9]']
+
+rxn.agents_list
+>> ['CCN(CC)CC', 'CCOCC']
+
Now we can extract a reaction template
+rxn.generate_reaction_template(radius=1)
+
+rxn.retro_template
+>> <rxnutils.chem.template.ReactionTemplate at 0x7fe4e9488d90>
+
+rxn.retro_template.smarts
+>> '[C:2]-[S;H0;D4;+0:1](=[O;D1;H0:3])(=[O;D1;H0:4])-[O;H0;D2;+0:6]-[C:5]>>Cl-[S;H0;D4;+0:1](-[C:2])(=[O;D1;H0:3])=[O;D1;H0:4].[C:5]-[OH;D1;+0:6]'
+
The radius
is an optional argument, specifying the radius of the template.
The reaction template, either the canonical (forward) or retro template is encapulsated in a +ReactionTemplate object that can be used to apply the template to a molecule or to generate +fingerprints or hash strings.
+Let’s see if the template generated above is capable of re-generating the expected reactants.
+smiles="CCS(=O)(=O)OCCBr"
+reactant_list = rxn.retro_template.apply(smiles)
+reactant_list
+>> (('CCS(=O)(=O)Cl', 'OCCBr'),)
+
we see that returned list (technically a tuple) contains one item, implying that the template +was specific and only produced one set of reactants. These reactants as you see are identical +to the reactants in the reaction SMILES above.
+To create a hash string for the template, there are a number of routines
+rxn.retro_template.hash_from_bits()
+>> 'a1727cc9ed68a6411bfd02873c1615c22baa1af4957f14ae942e2c85caf9adb5'
+
+rxn.retro_template.hash_from_smarts()
+>> '4cb9be0738a3a84e7ed4fb661d2efb73c099fc7d6c532a4b294c8d0d'
+
+rxn.retro_template.hash_from_smiles()
+>> '5b2ff2a69fb7bd6a032938e468684773bcc668928b037bbec0ac8335'
+
The first one is creating the hash string from the fingerprint bits that are one, whereas the +other two creates it by hashing the SMARTS and the SMILES string, respectively.
+A Morgan fingerprint can be computed for a reaction template:
+rxn.retro_template.fingerprint_vector(radius=2, nbits=1024)
+>> array([0., 0., 0., ..., 0., 0., 0.])
+
rxnutils
contain two pipelines that together downloads and prepares the USPTO reaction data so that it can be used on modelling.
It is a complete end-to-end pipeline that is designed to be transparent and reproducible.
+The reason the pipeline is divided into two blocks is because the dependencies of the atom-mapper package (rxnmapper
) is incompatible with
+the dependencies rxnutils
package. Therefore, to be able to use to full pipeline, you need to setup two python environment.
Install rxnutils
according to the instructions in the README-file
Install rxnmapper
according to the instructions in the repo: https://github.com/rxn4chemistry/rxnmapper
conda create -n rxnmapper python=3.6 -y
+conda activate rxnmapper
+conda install -c rdkit rdkit=2020.03.3.0
+python -m pip install rxnmapper
+
Install Metaflow
and rxnutils
in the new environment
python -m pip install metaflow
+python -m pip install --no-deps --ignore-requires-python .
+
Create a folder for the USPTO data and in that folder execute this command in the rxnutils
environment
conda activate rxn-env
+python -m rxnutils.data.uspto.preparation_pipeline run --nbatches 200 --max-workers 8 --max-num-splits 200
+
and then in the environment with the rxnmapper
run
conda activate rxnmapper
+python -m rxnutils.data.mapping_pipeline run --data-prefix uspto --nbatches 200 --max-workers 8 --max-num-splits 200
+
The -max-workers
flag should be set to the number of CPUs available.
On 8 CPUs and 1 GPU the pipeline takes a couple of hours.
+The pipelines creates a number of tab-separated CSV files:
++++
+- +
1976_Sep2016_USPTOgrants_smiles.rsmi and 2001_Sep2016_USPTOapplications_smiles.rsmi is the original USPTO data downloaded from Figshare
- +
uspto_data.csv is the combined USPTO data, with selected columns and a unique ID for each reaction
- +
uspto_data_cleaned.csv is the cleaned and filter data
- +
uspto_data_mapped.csv is the atom-mapped, modelling-ready data
Ignore extended SMILES information in the SMILES strings
Remove molecules not sanitizable by RDKit
Remove reactions without any reactants or products
Move all reagents to reactants
Remove the existing atom-mapping
Remove reactions with more than 200 atoms when summing reactants and products
(the last is a requisite for rxnmapper
that was trained on a maximum token size roughly corresponding to 200 atoms)
The uspo_data_mapped.csv
files will have the following columns:
+++
+- +
ID - unique ID created by concatenated patent number, paragraph and row index in the original data file
- +
Year - the year of the patent filing
- +
ReactionSmiles - the original reaction SMILES
- +
ReactionSmilesClean - the reaction SMILES after cleaning
- +
BadMolecules - molecules not sanitizable by RDKit
- +
ReactantSize - number of atoms in reactants
- +
ProductSize - number of atoms in products
- +
mapped_rxn - the mapped reaction SMILES
- +
confidence - the confidence of the mapping as provided by
rxnmapper