DreaMS (Deep Representations Empowering the Annotation of Mass Spectra) is a transformer-based neural network designed to interpret tandem mass spectrometry (MS/MS) data. Pre-trained in a self-supervised way on millions of unannotated spectra from our new GeMS (GNPS Experimental Mass Spectra) dataset, DreaMS acquires rich molecular representations by predicting masked spectral peaks and chromatographic retention orders. When fine-tuned for tasks such as spectral similarity, chemical properties prediction, and fluorine detection, DreaMS achieves state-of-the-art performance across various mass spectrometry interpretation tasks. The DreaMS Atlas, a comprehensive molecular network comprising 201 million MS/MS spectra annotated with DreaMS representations, along with pre-trained models and training datasets, is publicly accessible for further research and development. 🚀
This repository provides the code and tutorials to:
- 🔥 Generate DreaMS representations of MS/MS spectra and utilize them for downstream tasks such as spectral similarity prediction or molecular networking.
- 🤖 Fine-tune DreaMS for your specific tasks of interest.
- 💎 Access and utilize the extensive GeMS dataset of unannotated MS/MS spectra.
- 🌐 Explore the DreaMS Atlas, a molecular network of 201 million MS/MS spectra from diverse MS experiments annotated with DreaMS representations and metadata, such as studied species, experiment descriptions, etc.
- ⭐ Efficiently cluster MS/MS spectra in linear time using locality-sensitive hashing (LSH).
Additionally, for further research and development:
- 🔄 Convert conventional MS/MS data formats into our new, ML-friendly HDF5-based format.
- 📊 Split MS/MS datasets into training and validation folds using Murcko histograms of molecular structures.
📚 Please refer our documentation and paper "Emergence of molecular structures from repository-scale self-supervised learning on tandem mass spectra" for more details.
Run the following code from the command line.
# Download this repository
git clone https://github.com/pluskal-lab/DreaMS.git
cd DreaMS
# Create conda environment
conda create -n dreams python==3.11.0 --yes
conda activate dreams
# Install DreaMS
pip install -e .
If you are not familiar with conda or do not have it installed, please refer to the official documentation.
To compute DreaMS representations for MS/MS spectra from .mgf
file, run the following Python code.
from dreams.api import dreams_embeddings
embs = dreams_embeddings('data/examples/example_5_spectra.mgf')
The resulting embs
object is a matrix with 5 rows and 1024 columns, representing 5 1024-dimensional DreaMS representations for 5 input spectra stored in the .mgf
file.
- Paper: https://chemrxiv.org/engage/chemrxiv/article-details/6626775021291e5d1d61967f.
- Documentation and tutorials: https://dreams-docs.readthedocs.io/.
- Weights of pre-trained models: https://zenodo.org/records/10997887.
- Datasets:
- GeMS dataset: https://huggingface.co/datasets/roman-bushuiev/GeMS/tree/main/data.
- DreaMS Atlas: https://huggingface.co/datasets/roman-bushuiev/GeMS/tree/main/data/DreaMS_Atlas.
- Labeled MS/MS spectra: https://huggingface.co/datasets/roman-bushuiev/GeMS/tree/main/data/auxiliary.
If you use DreaMS in your research, please cite the following paper:
@article{bushuiev2024emergence,
author = {Bushuiev, Roman and Bushuiev, Anton and Samusevich, Raman and Brungs, Corinna and Sivic, Josef and Pluskal, Tomáš},
title = {Emergence of molecular structures from repository-scale self-supervised learning on tandem mass spectra},
journal = {ChemRxiv},
doi = {doi:10.26434/chemrxiv-2023-kss3r-v2},
year = {2024}
}