MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

ICLR 2026 Conference Submission15894 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models, Multimodal Reasoning, Video Benchmark

TL;DR: This paper introduces a video benchmark that requires multimodal deep reasoning, where question demand in-depth analysis across long-range, multi-frame video segments.

Abstract: The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame'') and perceive a few adjacent frames. To address this gap, we propose **MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos**. The benchmark is characterized by the following features. **(1) Long-range, multi-frame reasoning**: Models are required to infer and analyze evidence frames that may be far from the question frame. **(2) Beyond perception**: Questions cannot be answered through direct perception alone but require reasoning over hidden information. **(3) Reliability**: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. **(4) Confusability**: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, Gemini-2.5-pro, achieves only 64.3% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Error analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 15894

Loading