VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding
Keywords: Reasoning model, multimodal large model
TL;DR: We enhance multimodal reasoning by explicitly grounding visual and textual perceptions before reasoning, resulting in consistently improved performance across diverse benchmarks.
Abstract: Multimodal large language models (MLLMs) often struggle to ground reasoning in perceptual evidence. We present a systematic study of perception strategies—explicit, implicit, visual, and textual—across four multimodal benchmarks and two MLLMs. Our findings show that explicit perception, especially when paired with textual cues, consistently yields the best improvements, particularly for smaller models. Based on this insight, we propose VTPerception-R1, a unified two-stage framework that decouples perception from reasoning. Stage I introduces perception-augmented fine-tuning, and Stage II applies perception-aware reinforcement learning with novel visual, textual, and consistency rewards. Experiments demonstrate that VTPerception-R1 significantly improves reasoning accuracy and robustness across diverse tasks, offering a scalable and auditable solution
for perception-grounded multimodal reasoning.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6023
Loading