Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-define compound table #35

Open
3 tasks
jorainer opened this issue Dec 3, 2018 · 5 comments
Open
3 tasks

Re-define compound table #35

jorainer opened this issue Dec 3, 2018 · 5 comments

Comments

@jorainer
Copy link
Member

jorainer commented Dec 3, 2018

The purpose of the compound table:

  1. contain a unique entry for one compound
  2. allow to group e.g. multiple MS2 spectra to a single entity.

The question however is how to define a compound. What is a compound? An entity with its unique, own InChI? Structure == compound?

For the HMDB database it was pretty straight forward as HMDB provides compound identifiers. MoNa (issue #23)and Massbank (issue #34) however are more complicated as they don't allow to unify the data.

What we should do:

  • For HMDB: check if each compound ID has its own InChI.
  • Check PubChem (ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/)
  • Check ChEBI
@jorainer
Copy link
Member Author

jorainer commented Dec 3, 2018

For ChEBI (2018-12-03):

  • No. of compounds: 46765
  • No. of compounds without inchi: 7075
  • No. of compounds with inchi: 39690.
  • No. of unique inchis: 38946

So, we don't have an InChI for all of them and we have compounds with the same InChI! Apart from the name and the ID these compounds are however identical:

      compound_id                            compound_name
8564  CHEBI:17775   7,9-dihydro-1H-purine-2,6,8(3H)-trione
18506 CHEBI:46811 2,6-dihydroxy-7,9-dihydro-8H-purin-8-one
18507 CHEBI:46814                    9H-purine-2,6,8-triol
18509 CHEBI:46817                    7H-purine-2,6,8-triol
18513 CHEBI:46823                    1H-purine-2,6,8-triol
27249 CHEBI:62589     6-hydroxy-1H-purine-2,8(7H,9H)-dione
                                                                         inchi
8564  InChI=1S/C5H4N4O3/c10-3-1-2(7-4(11)6-1)8-5(12)9-3/h(H4,6,7,8,9,10,11,12)
18506 InChI=1S/C5H4N4O3/c10-3-1-2(7-4(11)6-1)8-5(12)9-3/h(H4,6,7,8,9,10,11,12)
18507 InChI=1S/C5H4N4O3/c10-3-1-2(7-4(11)6-1)8-5(12)9-3/h(H4,6,7,8,9,10,11,12)
18509 InChI=1S/C5H4N4O3/c10-3-1-2(7-4(11)6-1)8-5(12)9-3/h(H4,6,7,8,9,10,11,12)
18513 InChI=1S/C5H4N4O3/c10-3-1-2(7-4(11)6-1)8-5(12)9-3/h(H4,6,7,8,9,10,11,12)
27249 InChI=1S/C5H4N4O3/c10-3-1-2(7-4(11)6-1)8-5(12)9-3/h(H4,6,7,8,9,10,11,12)
                        inchi_key  formula    mass
8564  LEHOTFFKMJEONL-UHFFFAOYSA-N C5H4N4O3 168.028
18506 LEHOTFFKMJEONL-UHFFFAOYSA-N C5H4N4O3 168.028
18507 LEHOTFFKMJEONL-UHFFFAOYSA-N C5H4N4O3 168.028
18509 LEHOTFFKMJEONL-UHFFFAOYSA-N C5H4N4O3 168.028
18513 LEHOTFFKMJEONL-UHFFFAOYSA-N C5H4N4O3 168.028
27249 LEHOTFFKMJEONL-UHFFFAOYSA-N C5H4N4O3 168.028
> 

Question is whether these compounds would have different MS2 spectra? If so it would not make sense to combine them!

Some of the compounds without an inchi are listed below:

     compound_id            compound_name inchi inchi_key
3    CHEBI:10003     ribostamycin sulfate  <NA>      <NA>
15   CHEBI:10036                wax ester  <NA>      <NA>
91   CHEBI:10283     2-hydroxy fatty acid  <NA>      <NA>
140  CHEBI:10545                 electron  <NA>      <NA>
148  CHEBI:10583        kappa-carrageenan  <NA>      <NA>
154 CHEBI:106304 sphingomyelin d18:1/16:0  <NA>      <NA>
                     formula    mass
3       C17H34N4O10.(H2O4S)n      NA
15                     CO2R2  43.990
91  C2H3O3R __ C2H3O3R(CH2)n  75.008
140                     <NA>   0.000
148            (C12H17O12S)n      NA
154              C39H79N2O6P 702.568

@SiggiSmara
Copy link
Collaborator

SiggiSmara commented Dec 3, 2018

In the case of CHEBI:46814 and CHEBI:46817 for instance (and I suspect the rest of them) then they are not the same chemical at first glance (see below, different locations of a hydrogen), but in fact they are tautomers of each other. This is also indicated in the CHEBI entries of some of them if you look them up in CHEBI. That means they readily convert from one to the other without any external input (energy or otherwise) and thus should really be thought of as a mixture of all of them. The MS2 spectrum "should" be similar if not identical, buut the actualy ionization conditions (pH, buffer ions etc) might also have a big effect leading to different MS2 spectra.

Here I would suggest to get input from people that are actually working with tautomers to hear what they have to say about it.

46814
46814

and
46817
46817

@jorainer
Copy link
Member Author

jorainer commented Dec 4, 2018

Thanks for your input @SiggiSmara ! I'll try to get some input from people actually working with MS2 spectra and identification.

@stanstrup
Copy link
Collaborator

I have no experience with tautomers but one option could be to use the SMILES where this is explicit. You can also generate a non-standard InChI with the fixed-H layer from the SMILES.

@jorainer
Copy link
Member Author

jorainer commented Dec 4, 2018

Had also feedback from Steffen. They use the same approach than pubchem: a compound table with unique InChI and a substance table with additional annotations (eventually multiple entries per compound).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants