ATOM-Bench: From Atoms to Conclusions in Objective Evaluation of Large Multimodal Models Reasoning

Zipeng Wang; Zhiyang Dou; Xumeng Han; Xuehui Yu; Siyao Li; Guorong Li; Zhenjun Han; Jianbin Jiao

ATOM-Bench: From Atoms to Conclusions in Objective Evaluation of Large Multimodal Models Reasoning

Zipeng Wang, Zhiyang Dou, Xumeng Han, Xuehui Yu, Siyao Li, Guorong Li, Zhenjun Han, Jianbin Jiao

18 Sept 2025 (modified: 15 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal Large Language Models, benchmark, chain of thought

TL;DR: We introduce ATOM-Bench, a diagnostic benchmark for evaluating Chain-of-Thought reasoning in Large Multimodal Models via objective atomic questions, spanning 2,920 QAs over 570 real-world images, to address challenges of reasoning reliability.

Abstract: Chain-of-Thought (CoT) reasoning has significantly enhanced the ability of Large Multimodal Models (LMMs) to tackle complex image–text tasks, establishing itself as a cornerstone of multimodal learning. Despite significant progress, the impact of CoT on LMMs still lacks objective evaluation and in-depth research. Current CoT evaluation paradigms rely on powerful LLMs as judges of free-form text, but this introduces bias and hallucination from the evaluator itself. Moreover, it may penalize models for stylistic variations rather than genuine reasoning failures, thereby undermining the fairness and reliability of the assessment. To address this gap, we introduce ATOM-Bench, a CoT evaluation framework built on objective atomic questions. ATOM-Bench decomposes complex reasoning tasks into a series of atomic nodes, covering 570 high-resolution real-world images and 2,920 questions across 4 cognitive dimensions, and 12 domains, including architecture, text, transportation, culture, climate, and geology. Our benchmark introduces three novel quantitative metrics to objectively analyze reasoning faithfulness, consistency, and robustness. Extensive experiments with 22 LMMs validate the effectiveness of our framework. The results reveal that even the strongest models often exhibit a mismatch between surface-level correctness of final answers and their underlying evidence comprehension, while also exposing cognitive rigidity when faced with objective facts.We believe that ATOM-Bench, as a more objective and diagnostic tool, will advance LMMs toward more reliable and faithful reasoning.

Primary Area: datasets and benchmarks

Submission Number: 11801

Loading