Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add command line interface #29

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

marinegor
Copy link

Hi everyone,

as mentioned in #14 , I've added a command line interface to standartization from SMILES strings (namely, from input files containing SMILES as their first column). Also, I added an option to filter compounds using PAINS filters in RDKit as here -- it might be useful to switch it off by default, if you think it's more appropriate for this package.

The interface is following:

usage: chembl_std [-h] [-s] [-p] [-A] [-B] [-C] [--strict] [--header] [--verbose] [--stderr] INPUT

Sanitize smiles using chembl_structure_pipeline and RDKit PAINS filters

positional arguments:
  INPUT              Input file (with SMILES as first column)

optional arguments:
  -h, --help         show this help message and exit
  -s, --standartize  Whether to perform standartization of input SMILES (default: True)
  -p                 Filter molecules using all PAINS filters together (default: True)
  -A                 Filter molecules using all PAINS_A filter separately (default: False)
  -B                 Filter molecules using all PBINS_B filter separately (default: False)
  -C                 Filter molecules using all PCINS_C filter separately (default: False)
  --strict           Whether to raise an exception on first error (default: False)
  --header           Indicate that the input file contains header (default: False)
  --verbose          Whether to print all RDKit warnings to stdout (default: False)
  --stderr           Whether to print filtered molecules to stderr (default: False)

So in order to filter test.smi, one should do the following:

$ cat test.smi
smiles
c1ccccc1N=Nc1ccccc1
c1ccccc1N
CCO
$ chembl_std --header test.smi
smiles
c1ccccc1N
CCO

The downside is that it prints a lot of logging messages to stdout, and I could not completely disable them. For example, if I do chembl_std --header test.smi > out.smi, I'd get:

$ cat out.smi
smiles
c1ccccc1N
CCO
[01:33:17] Initializing Normalizer

The current workaround is to do chembl_std --header test.smi | grep -v Normalizer > out.smi. If someone knows how to manage it better, I'd appreciate.

@UnixJunkie
Copy link

I think this could be merged.

@UnixJunkie
Copy link

a -o option to say where the molecules passing std should be written to would be nice

@UnixJunkie
Copy link

-o FILENAME

@UnixJunkie
Copy link

mol_std should be printed out (in SMILES), rather than the SMILES line from the input file which passed standardization.
I guess, people are interested in molecules after standardization, rather than which molecules from the input file passed standardization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants