STR-Bamba: Multimodal Molecular Textual Representation Encoder-Decoder Foundation Model

Victor Y. Shirasuna; Emilio Vital Brazil; Eduardo Soares; Nathaniel H. Park; Dmitry Zubarev; Vidushi Sharma; Indra Priyadarsini; Caio Rodrigues Gama; Enzo Reis de Oliveira

STR-Bamba: Multimodal Molecular Textual Representation Encoder-Decoder Foundation Model

Victor Y. Shirasuna, Emilio Vital Brazil, Eduardo Soares, Nathaniel H. Park, Dmitry Zubarev, Vidushi Sharma, Indra Priyadarsini, Caio Rodrigues Gama, Enzo Reis de Oliveira

Published: 20 Sept 2025, Last Modified: 29 Oct 2025AI4Mat-NeurIPS-2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Foundation Model, Transformer, Mamba-2, SMILES, SELFIES, InChI, IUPAC Name, Molecular Formula, Polymer SMILES, Electrolyte Formulation

Abstract: Most large-scale chemical language models are trained on a single textual molecular representation using self-supervised learning over large unlabeled corpora. These models excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens. However, relying solely on one representation may result in the loss of structural or semantic information captured by alternative formats and may limit the model's ability to generalize across diverse molecular encodings. To address this limitation, we incorporate multiple textual molecular representations—including SMILES, SELFIES, molecular formula, IUPAC name, International Chemical Identifier (InChI), serialized polymer graph (SPG), and electrolyte formulations in an unified vocabulary to harness the unique strengths of each format. Here, we introduce a large encoder-decoder chemical foundation model based on the Bamba architecture, a hybrid of Transformers and Mamba-2 layers, designed to support multi-representational inputs. The model is pre-trained in a BERT-style on 588 million samples, resulting in a corpus of approximately 29 billion molecular tokens. These models serve as a foundation for language chemical research in supporting different complex tasks, including molecular properties prediction, classification, and molecular translation. Furthermore, extensive studies of the multimodal molecular latent space indicate cross-representation alignment and reveal how different textual encodings of the same molecule can converge toward a unified semantic representation. This shared space may facilitate deeper insights into molecular structure, enhance generalization, and support a broad range of downstream applications.

Submission Track: Paper Track (Full Paper)

Submission Category: AI-Guided Design

Institution Location: {Rio de Janeiro, Brazil},{San Jose, United States},{Tokyo, Japan}

AI4Mat Journal Track: Yes

Submission Number: 73

Loading