Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning
Keywords: scientific exexperiment image reasoning, multimodal large language models benchmark, AI for science
Abstract: We introduce **SPUR**, a benchmark for scientific experimental image perception, understanding, and reasoning. SPUR features three key innovations: (1) **Panel-Level Fine-Grained Perception**: Assessing MLLMs' visual perception abilities across three core dimensions (\ie{ numerical perception, morphological perception, and information localization}) on six fine-grained panel types; (2) **Cross-Panel Relation Understanding**: Leveraging complex scientific experimental images with an average of 14.3 panels per sample, we design QA pairs to evaluate MLLMs' ability to understand cross-panel relations; (3) **Expert-Level Reasoning**: We design qualitative and quantitative reasoning questions across five experiment types to specifically assess whether models, like human experts, can infer experimental conclusions from complex scientific images. Evaluation of 20 MLLMs and 4 MCoT methods reveals they are vastly inadequate in meeting the expert-level perception, understanding, and reasoning requirements for scientific experimental images, as mandated by AI for Science (AI4S) research. Data and code are available: https://anonymous.4open.science/r/SPUR-1797.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation, reproducibility
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 3642
Loading