Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import data from Massbank #34

Open
jorainer opened this issue Nov 19, 2018 · 5 comments
Open

Import data from Massbank #34

jorainer opened this issue Nov 19, 2018 · 5 comments

Comments

@jorainer
Copy link
Member

Import open data from Massbank (https://github.com/MassBank/MassBank-data).

Seems the data is in nicely structured txt files, so import should be straight forward.

@jorainer
Copy link
Member Author

Also here (similar to MoNa) we will run into the issue to reduce redundancies in the compound table. From a first look it seems we can however use "CH$LINK" entries, e.g. providing PubChem identifiers to define unique compounds and link the MS2 spectra to those.

@jorainer
Copy link
Member Author

Compound information we can extract with the corresponding field name:

  • compound_id: no explicit compound name here, but we could use one of the
    external database links.
  • compound_name: we can have multiple "CH$NAME: " - use one here, others
    down for synonyms.
  • inchi: "CH$IUPAC: ".
  • formula: "CH$FORMULA: ".
  • mass: "CH$EXACT_MASS: ".
  • synonyms: "CH$NAME: "

Additional fields we might want to get:

  • inchi_key: "CH$LINK: INCHIKEY ".
  • additional identifiers: "CH$LINK: CHEBI ", "CH$LINK: KEGG ", "CH$LINK: PUBCHEM ", "CH$LINK: CHEMSPIDER ".

We could use an self-generated identifier and collapse entries with the same
identifier based on either of the ones above.

Next we need to read the full data to check how to best reduce the information:

  • does every entry have an inchi key?
  • is there an external identifier present in all entries, e.g. PubChem?

@jorainer jorainer mentioned this issue Dec 3, 2018
3 tasks
@michaelwitting
Copy link
Collaborator

Import should be now possible with https://github.com/michaelwitting/MsBackendMassbank.
It's working fine so far.

@michaelwitting
Copy link
Collaborator

Do you need the field names exactly as you named them? I could adjust them in MsBackendMassBank. Do you want to have the adduct mass in mass or the neutral mass?

@jorainer
Copy link
Member Author

jorainer commented May 4, 2020

Regarding the field names: for the compounds table, if you have different names we can try to find a common ground for common names. For the msms_spectrum table, ideally the way I named them. I used the name of the attribute in Spectra (such as precursorMz) and replaced capital letters with _<lower case>, i.e. precursorMz -> precursor_mz.

And the mass should ideally contain the neutral monoisotopic mass. the adduct mass (m/z) should then be calculated with mass2mz or vice versa.

Let me know if something is unclear or we need to adapt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants