Keywords: Multimodal Reasoning, Spatial Reasoning, Visual Reasoning, Mathematical Reasoning, Multimodal Large Language Models
Abstract: A key frontier for Multimodal Large Language Models (MLLMs) is the ability to move beyond semantic description and perform structured spatial analysis directly from images. Mathematical surface plots provide a rigorous testbed for this capability, as they isolate systematic visual reasoning from the semantic noise of natural images. To measure progress on this frontier, we introduce MaRVL-QA (Mathematical Reasoning over Visual Landscapes), a new benchmark designed to quantitatively evaluate these foundational skills. The benchmark comprises two novel tasks: Topological Counting, which requires models to identify and enumerate local extrema; and Transformation Recognition, which tests their ability to detect applied geometric transformations. Generated from a curated library of functions with rigorous ambiguity filtering, our evaluation on MaRVL-QA reveals that even state-of-the-art MLLMs struggle significantly, often resorting to superficial heuristics instead of robust strategies. We present MaRVL-QA as a challenging diagnostic tool to expose current limitations and to guide the development of MLLMs with stronger and more systematic visual-mathematical abilities.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 20396
Loading