Benchmarking MLLMs on Topological Reasoning of Chemical Reaction Diagrams

02 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: VQA Benchmark, Multimodal Large Language Model, Chemical Reaction Diagram Understanding
Abstract: Chemical reaction diagrams are visual representations of complex process graphs, where understanding the overall pathway, including its branches, cycles, and flow, is crucial. While Multimodal Large Language Models (MLLMs) have shown proficiency in recognizing the individual nodes of these graphs, such as molecules and reagents, their ability to perform topological reasoning on the entire structure remains critically underexplored. This creates an urgent need for a targeted evaluation framework to probe this higher-order skill. Fulfilling this need, this paper introduces a systematic benchmark to evaluate this specific capability. We present **ReactBench**, a collection of 1,618 question-answer pairs designed to measure MLLM performance on a hierarchy of tasks, from component recognition to complex topological analysis. Our evaluation of state-of-the-art models reveals a significant deficit: while GPT-4o achieves 79.71% accuracy on node-level identification tasks, its performance plummets to 49.5% on questions that require true topological reasoning about the pathway. By providing the first focused benchmark for this skill, our work establishes a rigorous methodology for diagnosing a key failure mode in MLLMs and guiding the development of models that can comprehend the full, structured processes depicted in scientific diagrams.
Primary Area: datasets and benchmarks
Submission Number: 1142
Loading