MolTextQA: A Curated Question-Answering Dataset and Benchmark for Molecular Structure-Text Relationship Learning

Siddhartha Laghuvarapu; Namkyeong Lee; Chufan Gao; Jimeng Sun

MolTextQA: A Curated Question-Answering Dataset and Benchmark for Molecular Structure-Text Relationship Learning

Siddhartha Laghuvarapu, Namkyeong Lee, Chufan Gao, Jimeng Sun

27 Sept 2024 (modified: 10 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: molecule-text learning, question answering, datasets, benchmark, large language models

TL;DR: A benchmarking dataset for molecule structure - text relationship learning

Abstract: Recent advancements in AI have significantly enhanced molecular representation learning, which is crucial for predicting molecule properties and designing new molecules. Despite these advances, effectively utilizing the vast amount of molecular data available in textual form from databases and scholarly articles remains a challenge. Recently, a large body of research has focused on utilizing Large Language Models (LLMs) and multi-modal architectures to interpret textual information and link it with molecular structures. Nevertheless, existing datasets often lack specificity in evaluation, as well as direct comparisons and comprehensive benchmarking across different models and model classes. In this work, we construct a dataset specifically designed for evaluating models on structure-directed questions and textual description-based molecule retrieval, featuring over 500,000 question-answer pairs related to approximately 240,000 molecules from PubChem. Its structure enhances evaluation specificity and precision through the use of multiple-choice answers. Moreover, we benchmark various architectural classes fine-tuned using this dataset, including multi-modal architectures, and large language models, uncovering several insights. Our experiments indicate that the Galactica and BioT5 models are the top performers in Molecule QA and Molecule Retrieval tasks respectively, achieving about 70% accuracy. We have made both the dataset and the fine-tuned models publicly available.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12545

Loading