-
Notifications
You must be signed in to change notification settings - Fork 10
molfilevalidator.py
molfilevalidator.py performs mol file validation on files with .mol in gzipped tar files. This program supports two modes of running (specified by first argument on command line) which are described below.
NOTE: molfilevalidator.py requires openeye with license.
In this mode a database of molecules is generated from one of two sources. Either from a directory of .mol files (specified by --moldir) or from a CSV file containing Ligand IDS and Smile strings (specified by --molcsv).
Example invocation using moldir on .mol files in foo/ directory:
ls foo/
DVWB-FXR_10-1.mol DVWB-FXR_11-1.mol
export OE_LICENSE="/home/$USER/oe_license.txt"
molfilevalidator.py genmoleculedb --moldir ./foo --outputfile moleculedb.pickle
The above command will write out the database to the pickle file named moleculedb.pickle
TODO show example generating database from CSV file
In this mode the program takes the molecule database from genmoleculedb (which is passed in via --moleculedb flag) and validates all mol files found in the tarfile specified by --usersubmission flag. Any issues found are output to standard out/error.
molfilevalidator.py validate --usersubmission someusersubmission.tgz \
--moleculedb moleculedb.pickle --skipligand XXX_33
Example of output:
Molecule Errors
------------
In file: blah-XXX_9-1.mol ligand: XXX_9
Unable to parse file: OEReadMolecule returned False when trying to read mol file
In file: blah-XXX_21-1.mol ligand: XXX_21 Number of heavy atoms and or molecular weight did not match
Expected 226 for non hydrogen atomic weight, but got 215
Expected atom map { atomic #: # atoms,...} {8: 1, 9: 2, 7: 3, 6: 27, 17: 1}, but got {8: 1, 9: 2, 6: 28, 7: 3}
molfilevalidator.py outputs errors into two categories Ligand Errors and Molecule Errors. Ligand Errors involve problems extracting the ligand name from the mol file or major parsing problems such as zero size mol file. Molecule Errors pertain to problems with the mol file such as a parsing error or a difference in counts or total atomic weight of non hydrogen molecules.
usage: molfilevalidator.py [-h] [--moldir MOLDIR] [--molcsv MOLCSV]
[--molcsvligandcol MOLCSVLIGANDCOL]
[--molcsvsmilecol MOLCSVSMILECOL]
[--skipligand SKIPLIGAND]
[--usersubmission USERSUBMISSION]
[--outputfile OUTPUTFILE] [--moleculedb MOLECULEDB]
[--log {DEBUG,INFO,ERROR,ERROR,CRITICAL}]
[--version]
{validate,genmoleculedb}
Version 1.9.2
Performs mol file validation on files with .mol
in gzipped tar files.
This script runs in two modes: genmoleculedb & validate
These modes are set via the first argument passed into this script.
In general 'genmoleculedb' mode is run first and 'validate'
mode is run multiple times to perform the validation.
'genmoleculedb' mode takes a directory of .mol files or a CSV
file with SMILES strings and generates a
molecule database. This database is a pickle file
and is used to validate the mol files. The output
database is specified by the --outputfile flag.
This database basically is a dictionary of
Ligand names as parsed from the mol file
name XXX-####-XXX.mol where the ligand name is expected
to be the value between first and second - character.
Any problems found are output to standard out/err and
a non 0 exit code is returned.
'validate' mode takes the molecule database from genmoleculedb
(which is passed in via --moleculedb flag) and
validates all mol files found in the tarfile
specified by --usersubmission flag. It is assumed
all mol files have a file name format like this:
XXX-####-XXX.mol where #### between 1st and second -
is considered to be the Ligand ID.
Validation is done by comparing number and atomic
weight of non hydrogen atoms against the database.
Any problems found are output to standard out/err and
a non 0 exit code is returned.
For more information visit: http://www.drugdesigndata.org
positional arguments:
{validate,genmoleculedb}
Sets what mode script will run in. validate mode
checks a usersubmission set by --usersubmission flag
and genmoleculedb mode generates the molecule database
writing it to --outputfile
optional arguments:
-h, --help show this help message and exit
--moldir MOLDIR Directory containing mol files used to generate
database from
--molcsv MOLCSV CSV file sent to participants containingligand id and
Smile string for molecules.Used to generate molecule
database
--molcsvligandcol MOLCSVLIGANDCOL
Column containing ligand id in csv fileset via
--molcsv. 0 offset so 1st columnis 0 (default 0)
--molcsvsmilecol MOLCSVSMILECOL
Column containing SMILE string in csv fileset via
--molcsv. 0 offset so 1st columnis 0 (default 1)
--skipligand SKIPLIGAND
comma delimited list of ligands to skip
--usersubmission USERSUBMISSION
tar.gz file containing .mol files to validate
--outputfile OUTPUTFILE
Destination file to write molecule database
--moleculedb MOLECULEDB
Molecule database generated by --genmoleculedb mode
--log {DEBUG,INFO,ERROR,ERROR,CRITICAL}
Set the logging level (default ERROR)
--version show program's version number and exit