Mol2Lang-VLM: Vision- and Text-Guided Generative Pre-trained Language Models for Advancing Molecule Captioning through Multimodal Fusion

Published: 06 Jul 2024, Last Modified: 28 Jul 2024Language and Molecules ACL 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Generative Pre-trained Language Models, Molecule Captioning, Mulimodal Fusion, Molecule-to-Language Translation
TL;DR: This paper presents vision- and text-guided pre-trained language models for the molecule-to-language translation task by introducing third sub-layers into both the encoder and decoder of the Transformer-based architecture.
Abstract: This paper introduces Mol2Lang-VLM, an enhanced method for refining generative pre-trained language models for molecule captioning using multimodal features to achieve more accurate caption generation. Our approach leverages the encoder and decoder blocks of the Transformer-based architecture by introducing third sub-layers into both. Specifically, we insert sub-layers in the encoder to fuse features from SELFIES strings and molecular images, while the decoder fuses features from SMILES strings and their corresponding descriptions. Moreover, cross multi-head attention is employed instead of common multi-head attention to enable the decoder to attend to the encoder's output, thereby integrating the encoded contextual information for better and more accurate caption generation. Performance evaluation on the CheBI-20 and L+M-24 benchmark datasets demonstrates Mol2Lang-VLM's superiority, achieving higher accuracy and quality in caption generation compared to existing methods. Our code and pre-processed data are available at https://github.com/nhattruongpham/mol-lang-bridge/tree/mol2lang/.
Archival Option: The authors of this submission want it to appear in the archival proceedings.
Submission Number: 21
Loading