Talk

Transformer Model for Structure Elucidation from Tandem Mass Spectroscopy data

Abstract

Tandem mass spectrometry (MS/MS) is a crucial technique in multiple fields, including pharmaceutical analysis, drug discovery, diagnostics, metabolomics, proteomics, and environmental science. It enables the identification of molecules present in complex mixtures. However, the annotation and interpretation of MS/MS data remains challenging due to the complexity of the data and vastness of chemical space. To annotate MS/MS data, experimental spectra are typically compared against existing spectral databases in order to find the most similar matches. However, this approach is strongly limited by the size and diversity of the reference database used and does not allow for the elucidation of structures of novel compounds. In contrast, previous work has shown that Artificial Intelligence (AI) models can annotate MS/MS data without the need for database retrieval and direct comparison to existing references. Due to the difficulty of the task and scarcity of high-quality data, current state-of-the-art AI models adopt a sequential approach: first predicting molecular fingerprints and subsequently generating molecules as SMILES (Simplified Molecular-Input Line-Entry System). While this two-step strategy partially mitigates the challenge of MS/MS data annotation, it also introduces complexity with the current state of the art, exhibiting both low performance and failing to generalize beyond the training set.

We propose a novel Transformer model, which predicts SMILES directly from input MS/MS spectra in an end-to-end manner. This approach eliminates the need for intermediate steps, such as identifying molecular fingerprints. Leveraging transfer learning from simulated data, we achieve state-of-the-art performance on the challenging MassSpecGym dataset. This dataset is known for its extreme difficulty, with the best existing Top-1 zero-shot accuracy at 2.3%. Our model surpasses this benchmark by 1.1%, establishing a new upper bound of test performance at 3.4% Top-1 zero-shot accuracy.
These results highlight the potential of transformer-based models in simplifying and improving structure elucidation process from MS/MS data.

Related