MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs

MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs

ICLR 2026 Conference Submission23851 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: chemical language model, chemical reasoning model, chemistry, large language model, molecular graph, molecular structure

TL;DR: We propose MolecularIQ, a molecular structure reasoning benchmark focused exclusively on symbolically verifiable tasks.

Abstract: Large Language Models (LLMs) are increasingly applied to chemistry, tackling tasks such as molecular name conversion, captioning, text-guided generation, and property or reaction prediction. A molecule’s properties are fundamentally determined by its composition and structure, encoded in its molecular graph; thus, reasoning about molecular properties requires understanding and reasoning over the molecular structure. Yet, most existing benchmarks emphasize general chemical knowledge, rely on literature or surrogate labels that risk leakage or bias, or reduce evaluation to multiple-choice questions. We introduce MolecularIQ, a molecular structure reasoning benchmark focused exclusively on symbolically verifiable tasks. MolecularIQ spans three orthogonal axes — molecular complexity, multi-task load, and reasoning complexity — covering feature counting, index-based feature attributions, and constrained generation. MolecularIQ enables fine-grained evaluation of reasoning over molecular graphs and produces capability fingerprints that localize model failures to specific tasks and molecular regimes. This provides actionable insights into the strengths and limitations of current chemistry LLMs and guides the development of models that reason faithfully over molecular structure. On MolecularIQ, large MoE models with higher reasoning budgets lead across categories, while chemistry-tuned LLMs underperform their generalist bases, indicating limited transfer from narrow task fine-tuning.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 23851

Loading