MolTextQA: A Question-Answering Dataset and Benchmark for Evaluating Multimodal Architectures and LLMs on Molecular Structure–Text Understanding
Abstract: Recent advancements in AI have greatly improved molecular representation learning for property prediction and molecule design. However, leveraging the vast textual molecular data from databases and literature remains challenging. While recent research has explored Large Language Models (LLMs) and multi-modal architectures to link text with molecular structures, existing datasets lack evaluation specificity and comprehensive benchmarking. To address this, we introduce a dataset of 500,000 question-answer pairs covering 240,000 molecules from PubChem, designed for structure-directed questions and text-based molecule retrieval. Moreover, we benchmark various architectural classes fine-tuned using this dataset, including multi-modal architectures, large language models and large reasoning models uncovering several insights. Among the non-LLM baselines, BioT5 and MoleculeSTM achieved the highest performance on the Molecule QA and Molecule Retrieval tasks, respectively, with accuracies approaching 70%. While traditional LLMs struggled with general molecular understanding, our experiments show that fine-tuning LLMs can significantly improve their performance on molecular tasks. Furthermore, large reasoning models, particularly the GPT-o3 series outperform their non-reasoning counterparts and multi-modal architectures, highlighting the importance of explicit reasoning for effective structure–text learning. We have made both the dataset and the fine-tuned models publicly available.
Certifications: Dataset Certification
Keywords: molecule property prediction, molecule-text relationship learning, scientific language models
Code: https://github.com/siddharthal/MolTextQA
Assigned Action Editor: ~Mykola_Pechenizkiy1
Submission Number: 103
Loading