Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with aromatic rings in write.molecules() #108

Open
zachcp opened this issue Jun 29, 2020 · 2 comments
Open

issue with aromatic rings in write.molecules() #108

zachcp opened this issue Jun 29, 2020 · 2 comments

Comments

@zachcp
Copy link
Contributor

zachcp commented Jun 29, 2020

From my email:

Hi, 

I'm running into grief with write.molecules() in rcdk.  Any molecules 
that I am inputting with an aromatic ring containing double bonds, have 
those double bonds delocalised in the output .sdf file.  

See example of the pubchem sdf for salicylic acid (Structure2D_CID_338.sdf),
 which I'm using as input.  Using load.molecules() to read this in, then write.molecules() 
to output it as Structure2D_CID_338_cdk.sdf changes the bond coding from fixed to
 delocalised in the ring.  Taking this .sdf as input in openbabel turns it back to  Structure2D_CID_338_cdk_openbabel.sdf, which has fixed position aromatic bonds again.

Is there a way to control this behavior in the write to sdf file? I have some downstream programs that
can't deal with delocalised bonds in input sdf files (Compound DIscoverer)


thanks
Tony

> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /shared/storage/biology/rsrch/tf-PAB/lab/trl1/bin_research0/lib/R/lib/libRblas.so
LAPACK: /shared/storage/biology/rsrch/tf-PAB/lab/trl1/bin_research0/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB          LC_NUMERIC=C            LC_TIME=en_GB
 [4] LC_COLLATE=en_GB        LC_MONETARY=en_GB       LC_MESSAGES=en_GB
 [7] LC_PAPER=en_GB          LC_NAME=en_GB           LC_ADDRESS=en_GB
[10] LC_TELEPHONE=en_GB      LC_MEASUREMENT=en_GB    LC_IDENTIFICATION=en_GB

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] rcdk_3.5.0   rcdklibs_2.3 rJava_0.9-11

loaded via a namespace (and not attached):
[1] BiocManager_1.30.10 compiler_3.6.2      parallel_3.6.2
[4] fingerprint_3.5.7   tools_3.6.2         iterators_1.0.12
[7] itertools_0.1-3     png_0.1-7

Structure2D_CID_338.sdf

338
  CDK     06292018032D

 16 16  0  0  0  0  0  0  0  0999 V2000
    2.5369   -0.0600    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.1350    1.4400    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    3.4030    1.4400    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.2690   -0.0600    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.4030   -0.5600    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.1350   -0.5600    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.4030   -1.5600    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.1350   -1.5600    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2690   -2.0600    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2690    0.9400    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.6720   -0.2500    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    2.8660   -1.8700    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    5.6720   -1.8700    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    4.2690   -2.6800    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    2.0000   -0.3700    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    5.1350    2.0600    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
  1  5  1  0  0  0  0 
  1 15  1  0  0  0  0 
  2 10  1  0  0  0  0 
  2 16  1  0  0  0  0 
  3 10  2  0  0  0  0 
  4  5  4  0  0  0  0 
  4  6  4  0  0  0  0 
  4 10  1  0  0  0  0 
  5  7  4  0  0  0  0 
  6  8  4  0  0  0  0 
  6 11  1  0  0  0  0 
  7  9  4  0  0  0  0 
  7 12  1  0  0  0  0 
  8  9  4  0  0  0  0 
  8 13  1  0  0  0  0 
  9 14  1  0  0  0  0 
M  END
> <PUBCHEM_COMPOUND_CID>
338

> <PUBCHEM_COMPOUND_CANONICALIZED>
1

> <PUBCHEM_CACTVS_COMPLEXITY>
133

> <PUBCHEM_CACTVS_HBOND_ACCEPTOR>
3

> <PUBCHEM_CACTVS_HBOND_DONOR>
2

> <PUBCHEM_CACTVS_ROTATABLE_BOND>
1

> <PUBCHEM_CACTVS_SUBSKEYS>
AAADcYBgMAAAAAAAAAAAAAAAAAAAAAAAAAAwAAAAAAAAAAABAAAAGgAACAAADASAmAAwDoAAAgCIAiDSCAACAAAkIAAIiAEGCMgIJzaCFRKAcUAl4BEImYeIyCCOAAAAAAAIAAAAAAAAABAAAAAAAAAAAA==

> <PUBCHEM_IUPAC_OPENEYE_NAME>
2-hydroxybenzoic acid

> <PUBCHEM_IUPAC_CAS_NAME>
2-hydroxybenzoic acid

> <PUBCHEM_IUPAC_NAME_MARKUP>
2-hydroxybenzoic acid

> <PUBCHEM_IUPAC_NAME>
2-hydroxybenzoic acid

> <PUBCHEM_IUPAC_SYSTEMATIC_NAME>
2-oxidanylbenzoic acid

> <PUBCHEM_IUPAC_TRADITIONAL_NAME>
salicylic acid

> <PUBCHEM_IUPAC_INCHI>
InChI=1S/C7H6O3/c8-6-4-2-1-3-5(6)7(9)10/h1-4,8H,(H,9,10)

> <PUBCHEM_IUPAC_INCHIKEY>
YGSDEFSMJLZEOE-UHFFFAOYSA-N

> <PUBCHEM_XLOGP3>
2.3

> <PUBCHEM_EXACT_MASS>
138.031694

> <PUBCHEM_MOLECULAR_FORMULA>
C7H6O3

> <PUBCHEM_MOLECULAR_WEIGHT>
138.12

> <PUBCHEM_OPENEYE_CAN_SMILES>
C1=CC=C(C(=C1)C(=O)O)O

> <PUBCHEM_OPENEYE_ISO_SMILES>
C1=CC=C(C(=C1)C(=O)O)O

> <PUBCHEM_CACTVS_TPSA>
57.5

> <PUBCHEM_MONOISOTOPIC_WEIGHT>
138.031694

> <PUBCHEM_TOTAL_CHARGE>
0

> <PUBCHEM_HEAVY_ATOM_COUNT>
10

> <PUBCHEM_ATOM_DEF_STEREO_COUNT>
0

> <PUBCHEM_ATOM_UDEF_STEREO_COUNT>
0

> <PUBCHEM_BOND_DEF_STEREO_COUNT>
0

> <PUBCHEM_BOND_UDEF_STEREO_COUNT>
0

> <PUBCHEM_ISOTOPIC_ATOM_COUNT>
0

> <PUBCHEM_COMPONENT_COUNT>
1

> <PUBCHEM_CACTVS_TAUTO_COUNT>
-1

> <PUBCHEM_COORDINATE_TYPE>
1
5
255

> <PUBCHEM_BONDANNOTATIONS>
4  5  8
4  6  8
5  7  8
6  8  8
7  9  8
8  9  8

$$$$

Structure2D_CID_338_cdk_openbabel.sdf

338
 OpenBabel06292018072D

 16 16  0  0  0  0  0  0  0  0999 V2000
    2.5369   -0.0600    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.1350    1.4400    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    3.4030    1.4400    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.2690   -0.0600    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.4030   -0.5600    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.1350   -0.5600    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.4030   -1.5600    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.1350   -1.5600    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2690   -2.0600    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2690    0.9400    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.6720   -0.2500    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    2.8660   -1.8700    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    5.6720   -1.8700    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    4.2690   -2.6800    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    2.0000   -0.3700    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    5.1350    2.0600    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
  1  5  1  0  0  0  0
  1 15  1  0  0  0  0
  2 10  1  0  0  0  0
  2 16  1  0  0  0  0
  3 10  2  0  0  0  0
  4  5  2  0  0  0  0
  4  6  1  0  0  0  0
  4 10  1  0  0  0  0
  5  7  1  0  0  0  0
  6  8  2  0  0  0  0
  6 11  1  0  0  0  0
  7  9  2  0  0  0  0
  7 12  1  0  0  0  0
  8  9  1  0  0  0  0
  8 13  1  0  0  0  0
  9 14  1  0  0  0  0
M  END
>  <PUBCHEM_COMPOUND_CID>
338

>  <PUBCHEM_COMPOUND_CANONICALIZED>
1

>  <PUBCHEM_CACTVS_COMPLEXITY>
133

>  <PUBCHEM_CACTVS_HBOND_ACCEPTOR>
3

>  <PUBCHEM_CACTVS_HBOND_DONOR>
2

>  <PUBCHEM_CACTVS_ROTATABLE_BOND>
1

>  <PUBCHEM_CACTVS_SUBSKEYS>
AAADcYBgMAAAAAAAAAAAAAAAAAAAAAAAAAAwAAAAAAAAAAABAAAAGgAACAAADASAmAAwDoAAAgCIAiDSCAACAAAkIAAIiAEGCMgIJzaCFRKAcUAl4BEImYeIyCCOAAAAAAAIAAAAAAAAABAAAAAAAAAAAA==

>  <PUBCHEM_IUPAC_OPENEYE_NAME>
2-hydroxybenzoic acid

>  <PUBCHEM_IUPAC_CAS_NAME>
2-hydroxybenzoic acid

>  <PUBCHEM_IUPAC_NAME_MARKUP>
2-hydroxybenzoic acid

>  <PUBCHEM_IUPAC_NAME>
2-hydroxybenzoic acid

>  <PUBCHEM_IUPAC_SYSTEMATIC_NAME>
2-oxidanylbenzoic acid

>  <PUBCHEM_IUPAC_TRADITIONAL_NAME>
salicylic acid

>  <PUBCHEM_IUPAC_INCHI>
InChI=1S/C7H6O3/c8-6-4-2-1-3-5(6)7(9)10/h1-4,8H,(H,9,10)

>  <PUBCHEM_IUPAC_INCHIKEY>
YGSDEFSMJLZEOE-UHFFFAOYSA-N

>  <PUBCHEM_XLOGP3>
2.3

>  <PUBCHEM_EXACT_MASS>
138.031694

>  <PUBCHEM_MOLECULAR_FORMULA>
C7H6O3

>  <PUBCHEM_MOLECULAR_WEIGHT>
138.12

>  <PUBCHEM_OPENEYE_CAN_SMILES>
C1=CC=C(C(=C1)C(=O)O)O

>  <PUBCHEM_OPENEYE_ISO_SMILES>
C1=CC=C(C(=C1)C(=O)O)O

>  <PUBCHEM_CACTVS_TPSA>
57.5

>  <PUBCHEM_MONOISOTOPIC_WEIGHT>
138.031694

>  <PUBCHEM_TOTAL_CHARGE>
0

>  <PUBCHEM_HEAVY_ATOM_COUNT>
10

>  <PUBCHEM_ATOM_DEF_STEREO_COUNT>
0

>  <PUBCHEM_ATOM_UDEF_STEREO_COUNT>
0

>  <PUBCHEM_BOND_DEF_STEREO_COUNT>
0

>  <PUBCHEM_BOND_UDEF_STEREO_COUNT>
0

>  <PUBCHEM_ISOTOPIC_ATOM_COUNT>
0

>  <PUBCHEM_COMPONENT_COUNT>
1

>  <PUBCHEM_CACTVS_TAUTO_COUNT>
-1

>  <PUBCHEM_COORDINATE_TYPE>
1
5
255

>  <PUBCHEM_BONDANNOTATIONS>
4  5  8
4  6  8
5  7  8
6  8  8
7  9  8
8  9  8

$$$$

@trljcl
Copy link

trljcl commented Jun 30, 2020

further tests and input/output files attached. The problem seems to be that implicit H typing is lost in the write.molecules() process.

files.zip

library("rcdk")
Loading required package: rcdklibs
Loading required package: rJava

load structure from downloaded pubchem .sdf

mols <- load.molecules("Structure2D_CID_338.sdf")

plot structure

img <- view.image.2d(mols[[1]])
plot.new()
rasterImage(img, 0, 0, 1, 1)
savePlot("Structure2D_CID_338.png")
dev.off()
null device
1

write to sdf

write.molecules(mols, "Structure2D_CID_338_cdk.sdf", write.props = TRUE)

load and re plot

mols2 <- load.molecules("Structure2D_CID_338_cdk.sdf")

img2 <- view.image.2d(mols2[[1]])
Error in .jcall(mi, "[B", "getBytes", as.integer(depictor$getWidth()), :
java.lang.NullPointerException: One or more atoms had an undefined number of implicit hydrogens
plot.new()
rasterImage(img2, 0, 0, 1, 1)
Error in rasterImage(img2, 0, 0, 1, 1) : object 'img2' not found
savePlot("Structure2D_CID_338_cdk.png")
dev.off()
null device
1

sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS: /shared/storage/biology/rsrch/tf-PAB/lab/trl1/bin_research0/lib/R/lib/libRblas.so
LAPACK: /shared/storage/biology/rsrch/tf-PAB/lab/trl1/bin_research0/lib/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_GB LC_NUMERIC=C LC_TIME=en_GB
[4] LC_COLLATE=en_GB LC_MONETARY=en_GB LC_MESSAGES=en_GB
[7] LC_PAPER=en_GB LC_NAME=en_GB LC_ADDRESS=en_GB
[10] LC_TELEPHONE=en_GB LC_MEASUREMENT=en_GB LC_IDENTIFICATION=en_GB

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] rcdk_3.5.0 rcdklibs_2.3 rJava_0.9-11

loaded via a namespace (and not attached):
[1] compiler_3.6.2 parallel_3.6.2 fingerprint_3.5.7 iterators_1.0.12
[5] itertools_0.1-3 png_0.1-7

@zachcp
Copy link
Contributor Author

zachcp commented Jul 20, 2020

@trljcl apologies for the delay

I am looking at this again and agree that the hydrogens are lost. The write command is a wrapper of CDK's SDFWriter. Can you post this issue on the CDK user list? If it is explained/fixed in CDK, I can help get equivalent functionality in rCDK.

# suport for your theory
mols[[1]]$getAtomCount()
# 16
mols2[[1]]$getAtomCount()
# 10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants