Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning

Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning

ACL ARR 2026 January Submission3642 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: scientific exexperiment image reasoning, multimodal large language models benchmark, AI for science

Abstract: We introduce **SPUR**, a benchmark for scientific experimental image perception, understanding, and reasoning. SPUR features three key innovations: (1) **Panel-Level Fine-Grained Perception**: Assessing MLLMs' visual perception abilities across three core dimensions (\ie{ numerical perception, morphological perception, and information localization}) on six fine-grained panel types; (2) **Cross-Panel Relation Understanding**: Leveraging complex scientific experimental images with an average of 14.3 panels per sample, we design QA pairs to evaluate MLLMs' ability to understand cross-panel relations; (3) **Expert-Level Reasoning**: We design qualitative and quantitative reasoning questions across five experiment types to specifically assess whether models, like human experts, can infer experimental conclusions from complex scientific images. Evaluation of 20 MLLMs and 4 MCoT methods reveals they are vastly inadequate in meeting the expert-level perception, understanding, and reasoning requirements for scientific experimental images, as mandated by AI for Science (AI4S) research. Data and code are available: https://anonymous.4open.science/r/SPUR-1797.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, evaluation, reproducibility

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English

Submission Number: 3642

Loading