Lang2Mol-Diff: A Diffusion-Based Generative Model for Language-to-Molecule Translation Leveraging SELFIES Representation

Nguyen Doan Hieu Nguyen; Nhat Truong Pham; Duong Thanh Tran; Balachandran Manavalan

Lang2Mol-Diff: A Diffusion-Based Generative Model for Language-to-Molecule Translation Leveraging SELFIES Representation

Nguyen Doan Hieu Nguyen, Nhat Truong Pham, Duong Thanh Tran, Balachandran Manavalan

Published: 06 Jul 2024, Last Modified: 28 Jul 2024Language and Molecules ACL 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Generative Pre-trained Language Models, Diffusion Model, Language-to-Molecule Translation, SELFIES Representation

TL;DR: This paper introduces a diffusion-based language-to-molecule generative model (Lang2Mol-Diff) using SELFIES representation to address potential issues with molecule validity and the limitations of autoregressive models.

Abstract: Generating de novo molecules from textual descriptions is challenging due to potential issues with molecule validity in SMILES representation and limitations of autoregressive models. This work introduces Lang2Mol-Diff, a diffusion-based language-to-molecule generative model using the SELFIES representation. Specifically, Lang2Mol-Diff leverages the strengths of two state-of-the-art molecular generative models: BioT5 and TGM-DLM. By employing BioT5 to tokenize the SELFIES representation, Lang2Mol-Diff addresses the validity issues associated with SMILES strings. Additionally, it incorporates a text diffusion mechanism from TGM-DLM to overcome the limitations of autoregressive models in this domain. To the best of our knowledge, this is the first study to leverage the diffusion mechanism for text-based de novo molecule generation using the SELFIES molecular string representation. Performance evaluation on the L+M-24 benchmark dataset shows that Lang2Mol-Diff outperforms all existing methods for molecule generation in terms of validity. Our code and pre-processed data are available at https://github.com/nhattruongpham/mol-lang-bridge/tree/lang2mol/.

Archival Option: The authors of this submission want it to appear in the archival proceedings.

Submission Number: 25

Loading