MMReD: a Cross-Modal Benchmark for Dense Context Reasoning

ICLR 2026 Conference Submission19027 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: long context, reasoning, LLM, LVLM, MLLM, benchmark
TL;DR: We present MMReD, a benchmark revealing that state-of-the-art LLMs and LVLMs struggle with dense multi-modal reasoning over long contexts, highlighting critical architectural limitations and the need for fundamentally new approaches.
Abstract: Despite recent advancements in extending context windows of large language models (LLMs) and large vision-language models (LVLMs), their ability to perform complex multi-modal reasoning over extended contexts remains critically limited. To underline this challenge, we present MMReD, a benchmark specifically designed to assess reasoning abilities within dense, information-rich scenarios where simple retrieval is not enough. Unlike traditional Needle-in-a-Haystack evaluations, MMReD challenges models to identify and interpret global patterns across entire contexts. Our benchmark comprises 24 tasks of varying complexity, ranging from standard passkey retrieval setups to those requiring selective or uniform attention to all context chunks. The evaluation reveals a consistent performance drop across all tested models -- including the most advanced LLMs, LVLMs, and architectures specializing in code and reasoning -- as the number of observations increases. Notably, even the leading reasoning-specialized models achieve 0\% accuracy on certain tasks at the maximum context length of 128 observations. Conventional fine-tuning techniques, such as SFT and GRPO, also fail to generalize effectively to longer contexts. These observations reveal an inherent limitation in current model architectures, emphasizing the need for innovative approaches to enable competent dense context reasoning in multi-modal AI systems.
Primary Area: datasets and benchmarks
Submission Number: 19027
Loading