GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts

15 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: visual mathematical reasoning, visual benchmark, grade school math word problem
TL;DR: GSM8K-V is a new multi-image visual mathematical reasoning benchmark that exposes major gaps in current vision-language models’ reasoning abilities compared to their strong performance on text-based tasks.
Abstract: Vision language models (VLMs) achieve unified modeling of images and text, enabling them to accomplish complex real-world tasks through perception, planning, and reasoning. Among these tasks, reasoning is particularly representative, with mathematical reasoning serving as a prominent example. Mathematical reasoning highlights the high-level capability of VLMs to comprehend mathematical information embedded in images and to perform sophisticated reasoning processes. Recently, numerous visual mathematical reasoning benchmarks have been proposed to evaluate the mathematical reasoning capabilities of VLMs. However, these benchmarks suffer from several limitations: they are typically restricted to geometry problems, lack comprehensive evaluation on math word problems, and rarely assess the ability to reason across multiple images. To fill this gap, we introduce GSM8K-V, a purely visual multi-image mathematical reasoning benchmark. GSM8K-V is constructed by systematically mapping each sample from the widely used text-based mathematical reasoning benchmark GSM8K into visual form. Through a carefully designed automated image-generation pipeline combined with meticulous human annotation, we curate a benchmark comprising 1,319 high-quality samples. We evaluate a wide range of open-source and close-source models on the proposed GSM8K-V benchmark. Our results reveal that, although existing VLMs have achieved nearly saturated performance on the text-based GSM8K, there remains substantial room for improvement on the purely visual GSM8K-V. For instance, the best-performing model, Gemini-2.5-Pro, attains 95.22% accuracy on GSM8K but only 46.93% on GSM8K-V. We conduct a comprehensive and detailed analysis on GSM8K-V, systematically examining the limitations of existing models on this benchmark as well as potential directions for improvement. GSM8K-V provides a new perspective on visual mathematical reasoning and establishes a novel evaluation benchmark that can guide the research community toward developing more robust and generalizable VLMs.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 6090
Loading