VGR: Visual Grounded Reasoning

VGR: Visual Grounded Reasoning

ICLR 2026 Conference Submission23089 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: VLM, MultiModal, Cot

TL;DR: We presents VGR, an MLLM with fine-grained visual perception, addressing language bias in traditional multimodal reasoning by detecting relevant image regions for precise answers, using reasoning data and an inference pipeline with visual replay.

Abstract: In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure linguistic space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) that can replay the visual memory during thinking just like humans. Unlike traditional MLLMs, VGR first thinks the question and detects relevant regions that may help solve problems, then, the visual memory from the critical area is extracted to assist reasoning. To achieve this, we curate a large-scale SFT dataset called VGR-SFT that contains reasoning data with mixed vision grounding and language deduction. This teaches VGR to think and actively choose grounding areas for key information before answering, and we propose a dynamic visual memory replay stage to integrates the corresponding information into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multimodal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and +12.9 improvement on ChartQA.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 23089

Loading