Keywords: Multimodal Language Model, Vision Language Model, Reasoning, Benchmarks
Abstract: The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence. However, existing reasoning benchmarks focus on text-only reasoning, or employ multimodal questions that can be answered by directly retrieving information from a non-text modality. Thus, complex reasoning remains poorly understood in multimodal domains. Here, we present MARBLE, a challenging multimodal reasoning benchmark that is designed to scrutinize multimodal language models (MLLMs) in their ability to carefully reason step-by-step through complex multimodal problems and environments. MARBLE is composed of two highly challenging tasks, M-Portal, M-Cube and M-Maze, that require the crafting and understanding of multistep plans leveraging spatial, visual, and physical constraints, which must be executed correctly and in the proper order. We find that current MLLMs perform poorly on MARBLE - all 12 advanced models obtain around 0\% accuracy performance on M-Cube and M-Maze, while only Grok-4 and GPT-5 slightly outperformed the random baseline on M-Portal. These results indicate that complex reasoning is still a challenge for existing MLLMs. Moreover, we show that perception remains a critical bottleneck to mulitmodal reasoning. By shedding light on the limitations of MLLMs, we hope that \method will spur the development of the next generation of models with the ability to reason and plan across many multimodal reasoning steps.
Primary Area: datasets and benchmarks
Submission Number: 24409
Loading