Keywords: Multimodal reasoning, reinforcement learning
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved the reasoning capabilities of large language models (LLMs). Recent research has also extended it to multimodal large language models (MLLMs) to enhance multimodal reasoning. However, through systematic error analysis, we find that while RLVR effectively reduces reasoning errors in MLLMs, it fails to address perceptual errors, which often lead to incorrect inference results. Limited visual perception is a major bottleneck in multimodal reasoning. To address this issue, we propose a novel visual perception-enhanced reward model that explicitly encourages accurate visual understanding as a prerequisite for reasoning. Specifically, our approach first incentivizes accurate visual perception prior to reasoning and then assigns a perception-based reward to reinforce correct understanding of the visual input. Extensive experiments on multiple multimodal reasoning benchmarks demonstrate that our approach effectively alleviates the perceptual bottleneck and promotes more reliable multimodal reasoning.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 23983
Loading