DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
TL;DR: DiffMS incorporates discrete graph diffusion for de novo generation from mass spectra and reaches state-of-the-art performance.
Abstract: Mass spectrometry plays a fundamental role in elucidating the structures of unknown molecules and subsequent scientific discoveries. One formulation of the structure elucidation task is the conditional *de novo* generation of molecular structure given a mass spectrum. Toward a more accurate and efficient scientific discovery pipeline for small molecules, we present DiffMS, a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task. The encoder utilizes a transformer architecture and models mass spectra domain knowledge such as peak formulae and neutral losses, and the decoder is a discrete graph diffusion model restricted by the heavy-atom composition of a known chemical formula. To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs, which are available in virtually infinite quantities, compared to structure-spectrum pairs that number in the tens of thousands. Extensive experiments on established benchmarks show that DiffMS outperforms existing models on *de novo* molecule generation. We provide several ablations to demonstrate the effectiveness of our diffusion and pretraining approaches and show consistent performance scaling with increasing pretraining dataset size. DiffMS code is publicly available at https://github.com/coleygroup/DiffMS.
Lay Summary: Identifying unknown small molecules is a major bottleneck in scientific discovery. When scientists analyze a sample, they often use a technique called mass spectrometry, which breaks molecules apart then measures the weights and frequencies of the pieces, resulting in a mass spectrum that represents the structure. The challenge is that it's incredibly difficult to work backward from this spectrum to the original molecule's structure, especially since many different molecules can produce similar spectra. To tackle this, we created DiffMS, an machine learning algorithm that learns to generate a molecule's structure given its mass spectrum and its chemical formula (which is usually easier to determine). DiffMS uses a specialized "encoder" to extract information from the spectrum and a "decoder" that takes the structural information from the encoder and generates candidate molecules that may match the observed spectrum. Our findings show that while DiffMS doesn't always generate the correct molecule, it is pretty good at generating similar molecules, which are still useful for chemists to manually determine the true structure.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/coleygroup/DiffMS
Primary Area: Applications->Chemistry, Physics, and Earth Sciences
Keywords: AI4Science, Mass Spectra, Diffusion, Graph Neural Networks
Submission Number: 5469
Loading