I'm an award-winning data scientist bridging cheminformatics and metabolomics focusing on small molecule discovery and mass spectrometry data sciences (see my award news from Metabolomics Association of North America (MANA) and my presentation details here).
I've crafted multiple computational pipelines designed for untargeted mass spectrometry data processing across diverse research domains including metabolomics, lipidomics, exposomics, and environmental studies. My software development philosophy emphasizes on maximal automation, highest precision, multi-platform compatibility, and user-friendly interfaces to minimize lab-based experiments.
I am always driven to advance next-generation AI for chemistry and biological applications.
Developing AI-Powered Digital Twins for Bioreactors at Aropha
I am currently leading the development of digital twins for bioreactors at Aropha utilizing advanced AI models to simulate bioprocesses. By creating virtual replicas of our bioreactor systems, we aim to predict performance and scale up the company’s capacity effectively. This work integrates cutting-edge AI engines with bioprocess engineering.
Mass Spectrometry Data Processing Workflows at the Integrated Data Science Laboratory for Metabolomics and Exposomics
Tools shown in this diagram form a comprehensive pipeline for full-scale untargeted metabolomics workflow to efficiently process, and annotate large-scale mass spectrometry data. The integration of peak detection, formula annotation, fragmentation analysis, and data parsing facilitates any muti-omics or untartgeted compound discovery projects. IDSL_MINT (Mass INTerpretator) utilizes deep learning and cheminformatics to interpret MS/MS data. IDSL.IPA (Intrinsic Peak Analysis) is a chromatographic peak-picking software capable of detecting low-intensity signals (S/N > 2), pairing isotopologues with a fixed distance (e.g. ΔC = 13C - 12C = 1.003354835336 Da), correcting retention time drifts, aligning peaks across large studies (N > 200), filling gaps, and visualizing extracted and total ion chromatograms. IDSL.FSA (Fragmentation Spectra Analysis) rapidly annotates fragmentation data files (.msp and .mgf) using spectral entropy or cosine similarity, even without reliable precursor values, and can process bottom-up proteomics data. IDSL.CSA (Composite Spectra Analysis) deconvolutes fragmentation spectra from various acquisition methods like DDA and DIA (SWATH-MS, MSE, AIF). IDSL.UFA (United Formula Annotation) and its exhaustive version IDSL.UFAx annotate chromatographic peaks with molecular formulas using isotopic profile matching; IDSL.UFA handles up to 108 formulas efficiently, while IDSL.UFAx can screen 1027 formulas using 15 elements, though it is less computationally fast. IDSL.SUFA simplifies isotopic profile and adduct formula calculations without dependencies on other R packages. IDSL.NPA (Nominal Peak Analysis) processes nominal mass spectrometry data to create and annotate .msp files for untargeted MS/MS workflows. Lastly, IDSL.MXP (Mass Spectrometry Parser) is a lightweight and fast parser for mass spectrometry data files, capable of reading corrupted mass spectrometry files.
Computational mass spectrometry pipelines for environmental cheminformatics projects as part of my doctoral research
- An IPDC (Isotopic Profile Deconvoluted Chromatogram) algorithm to screen biologically complex environmental matrices for unknown contaminants using chemometric methods. The IPDC algorithm was successfully employed in five different projects during my PhD.