Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

ICLR 2026 Conference Submission16373 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Vision-Language Model, Mathematical Reasoning

TL;DR: We created VCBench to test AI on simple, multi-image visual math problems. Top AI models failed, scoring below 50%, showing a big gap in their reasoning.

Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have significantly enhanced their ability to integrate visual and linguistic information, achieving near-human proficiency in tasks like object recognition, captioning, and visual question answering. However, current benchmarks typically focus on knowledge-centric evaluations that assess domain-specific expertise, often neglecting the core ability to reason about fundamental mathematical elements and visual concepts. We identify a gap in evaluating elementary-level math problems, which rely on explicit visual dependencies-requiring models to discern, integrate, and reason across multiple images while incorporating commonsense knowledge, all o which are crucial for advancing toward broader AGI capabilities. To address this gap, we introduce VCBench, a comprehensive benchmark for multimodal mathematical reasoning with explicit visual dependencies. VCBench includes 1,720 problems across six cognitive domains, featuring 6,697 images (averaging 3.9 per question) to ensure multi-image reasoning. We evaluate 26 state-of-the-art LVLMs on VCBench, revealing substantial performance disparities, with even the top models unable to exceed 50% accuracy. Our findings highlight the ongoing challenges in visual-mathematical integration and suggest avenues for future LVLM advancements.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 16373

Loading