TL;DR: We develop a retrieval-augmented generation approach with neural graph matching for mass spectrum prediction, achieving a 45% relative improvement in top-1 retrieval accuracy.
Abstract: Molecular machine learning has gained popularity with the advancements of geometric deep learning. In parallel, retrieval-augmented generation has become a principled approach commonly used with language models. However, the optimal integration of retrieval augmentation into molecular machine learning remains unclear. Graph neural networks stand to benefit from clever matching to understand the structural alignment of retrieved molecules to a query molecule. Neural graph matching offers a compelling solution by explicitly modeling node and edge affinities between two structural graphs while employing a noise-robust, end-to-end neural network to learn affinity metrics. We apply this approach to mass spectrum simulation and introduce MARASON, a novel model that incorporates neural graph matching to enhance a fragmentation-based neural network. Experimental results highlight the effectiveness of our design, with MARASON achieving 27% top-1 accuracy, a substantial improvement over the non-retrieval state-of-the-art accuracy of 19%. Moreover, MARASON outperforms both naive retrieval-augmented generation methods and traditional graph matching approaches. Code is publicly available at https://github.com/coleygroup/ms-pred.
Lay Summary: Retrieval-augmented generation (RAG) enhances the accuracy of text generation, particularly in large language models (LLMs), although its potential in molecular machine learning remains to be understood.
In this paper, we identify that the structural alignment between reference and target molecules is an important factor in RAG-aided tasks, whereas a naive retrieval-augmented generation method was found to have a negative impact. Instead, neural graph matching offers a noise-robust, end-to-end solution. By focusing on mass spectrum simulation, we present MARASON, a mass spectrum prediction network that integrates reference intensity information through graph matching of the reference and target structures.
We highlight the effectiveness of our design, with MARASON achieving 27% top-1 accuracy in structural identification, a substantial improvement over the non-retrieval state-of-the-art accuracy of 19%. MARASON also outperforms traditional graph matching approaches.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/coleygroup/ms-pred
Primary Area: Applications->Chemistry, Physics, and Earth Sciences
Keywords: Retrieval Augmented Generation, Graph Matching, Mass Spectrometry, AI for Science
Submission Number: 2652
Loading