MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: A Comprehensive Evaluation Suite for Multimodal Reasoning
Abstract: Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation. In this paper, we introduce **MME-CoT**, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level. Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights: *1)* Models with reflection mechanism demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and demonstrating the highest quality results; *2)* CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; and *3)* Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases. We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs.
Lay Summary: This paper introduces MME-CoT, a new benchmark for testing whether AI systems that process both text and images can reason through problems step-by-step, rather than just checking final answers. Testing 16 AI models across math, science, and visual reasoning tasks, the researchers found that while self-correcting models generally perform better, step-by-step reasoning often hurts performance on simple visual tasks (AI "overthinks" easy problems) and about 40% of self-correction attempts don't actually help solve problems, revealing important flaws in how current AI systems think.
Link To Code: https://github.com/MME-Benchmarks/MME-CoT
Primary Area: Deep Learning->Everything Else
Keywords: Evaluation and Benchmark, MLLM, Reasoning
Submission Number: 1180
Loading