Keywords: VLM, MultiModal, Cot
TL;DR: We presents VGR, an MLLM with fine-grained visual perception, addressing language bias in traditional multimodal reasoning by detecting relevant image regions for precise answers, using reasoning data and an inference pipeline with visual replay.
Abstract: In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches
predominantly rely on reasoning on pure linguistic space, which inherently suffers
from language bias and is largely confined to math or science domains. This narrow
focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper
introduces VGR, a novel reasoning multimodal large language model (MLLM) that
can replay the visual memory during thinking just like humans. Unlike traditional
MLLMs, VGR first thinks the question and detects relevant regions that may help
solve problems, then, the visual memory from the critical area is extracted to assist
reasoning. To achieve this, we curate a large-scale SFT dataset called VGR-SFT
that contains reasoning data with mixed vision grounding and language deduction.
This teaches VGR to think and actively choose grounding areas for key information before answering, and we propose a dynamic visual memory replay stage to
integrates the corresponding information into the reasoning process, enhancing
multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show
that VGR achieves superior performance on multimodal benchmarks requiring
comprehensive image detail understanding. Compared to the baseline, VGR uses
only 30% of the image token count while delivering scores of +4.1 on MMStar,
+7.1 on AI2D, and +12.9 improvement on ChartQA.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 23089
Loading