Skip to content

molfilevalidator.py

Chris Churas edited this page Oct 4, 2017 · 19 revisions

molfilevalidator.py performs mol file validation on files with .mol in gzipped tar files. This program supports two modes of running (specified by first argument on command line) which are described below.

NOTE: molfilevalidator.py requires openeye with license.

Molecule database generation mode (genmoleculedb)

In this mode a database of molecules is generated from one of two sources. Either from a directory of .mol files (specified by --moldir) or from a CSV file containing Ligand IDS and Smile strings (specified by --molcsv).

Example invocation using moldir on .mol files in foo/ directory:

ls foo/
DVWB-FXR_10-1.mol  DVWB-FXR_11-1.mol

export OE_LICENSE="/home/$USER/oe_license.txt"

molfilevalidator.py genmoleculedb --moldir ./foo --outputfile moleculedb.pickle

The above command will write out the database to the pickle file named moleculedb.pickle

TODO show example generating database from CSV file

Validation mode (validation)

In this mode the program takes the molecule database from genmoleculedb (which is passed in via --moleculedb flag) and validates all mol files found in the tarfile specified by --usersubmission flag. Any issues found are output to standard out/error.

molfilevalidator.py validate --usersubmission someusersubmission.tgz \
--moleculedb moleculedb.pickle  --skipligand XXX_33

Example of output:

Molecule Errors
------------

In file: blah-XXX_9-1.mol ligand: XXX_9
	Unable to parse file: OEReadMolecule returned False when trying to read mol file

In file: blah-XXX_21-1.mol ligand: XXX_21 Number of heavy atoms and or molecular weight did not match 
	Expected 226 for non hydrogen atomic weight, but got 215
	Expected atom map { atomic #: # atoms,...} {8: 1, 9: 2, 7: 3, 6: 27, 17: 1}, but got {8: 1, 9: 2, 6: 28, 7: 3}

molfilevalidator.py outputs errors into two categories Ligand Errors and Molecule Errors. Ligand Errors involve problems extracting the ligand name from the mol file or major parsing problems such as zero size mol file. Molecule Errors pertain to problems with the mol file such as a parsing error or a difference in counts or total atomic weight of non hydrogen molecules.

Full command documentation

usage: molfilevalidator.py [-h] [--moldir MOLDIR] [--molcsv MOLCSV]
                           [--molcsvligandcol MOLCSVLIGANDCOL]
                           [--molcsvsmilecol MOLCSVSMILECOL]
                           [--skipligand SKIPLIGAND]
                           [--usersubmission USERSUBMISSION]
                           [--outputfile OUTPUTFILE] [--moleculedb MOLECULEDB]
                           [--log {DEBUG,INFO,ERROR,ERROR,CRITICAL}]
                           [--version]
                           {validate,genmoleculedb}

              Version 1.9.2

              Performs mol file validation on files with .mol
              in gzipped tar files.

              This script runs in two modes: genmoleculedb & validate

              These modes are set via the first argument passed into this script.

              In general 'genmoleculedb' mode is run first and 'validate'
              mode is run multiple times to perform the validation.

              'genmoleculedb' mode takes a directory of .mol files or a CSV
                            file with SMILES strings and generates a
                            molecule database. This database is a pickle file
                            and is used to validate the mol files. The output
                            database is specified by the --outputfile flag.
                            This database basically is a dictionary of
                            Ligand names as parsed from the mol file
                            name XXX-####-XXX.mol where the ligand name is expected
                            to be the value between first and second - character.

                            Any problems found are output to standard out/err and
                            a non 0 exit code is returned.

              'validate' mode takes the molecule database from genmoleculedb
                              (which is passed in via --moleculedb flag) and
                              validates all mol files found in the  tarfile
                              specified by --usersubmission flag. It is assumed
                              all mol files have a file name format like this:
                              XXX-####-XXX.mol where #### between 1st and second -
                              is considered to be the Ligand ID.

                              Validation is done by comparing number and atomic
                              weight of non hydrogen atoms against the database.

                              Any problems found are output to standard out/err and
                              a non 0 exit code is returned.

              For more information visit: http://www.drugdesigndata.org

              

positional arguments:
  {validate,genmoleculedb}
                        Sets what mode script will run in. validate mode
                        checks a usersubmission set by --usersubmission flag
                        and genmoleculedb mode generates the molecule database
                        writing it to --outputfile

optional arguments:
  -h, --help            show this help message and exit
  --moldir MOLDIR       Directory containing mol files used to generate
                        database from
  --molcsv MOLCSV       CSV file sent to participants containingligand id and
                        Smile string for molecules.Used to generate molecule
                        database
  --molcsvligandcol MOLCSVLIGANDCOL
                        Column containing ligand id in csv fileset via
                        --molcsv. 0 offset so 1st columnis 0 (default 0)
  --molcsvsmilecol MOLCSVSMILECOL
                        Column containing SMILE string in csv fileset via
                        --molcsv. 0 offset so 1st columnis 0 (default 1)
  --skipligand SKIPLIGAND
                        comma delimited list of ligands to skip
  --usersubmission USERSUBMISSION
                        tar.gz file containing .mol files to validate
  --outputfile OUTPUTFILE
                        Destination file to write molecule database
  --moleculedb MOLECULEDB
                        Molecule database generated by --genmoleculedb mode
  --log {DEBUG,INFO,ERROR,ERROR,CRITICAL}
                        Set the logging level (default ERROR)
  --version             show program's version number and exit
Clone this wiki locally