Enhancing Multimodal LLMs Reasoning via Perception Reward Modeling

Jingru Duan; Jinda Lu; Junkang Wu; SHUANGYANG WANG; Xiang Wang; Xiangnan He

Enhancing Multimodal LLMs Reasoning via Perception Reward Modeling

Jingru Duan, Jinda Lu, Junkang Wu, SHUANGYANG WANG, Xiang Wang, Xiangnan He

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal reasoning, reinforcement learning

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved the reasoning capabilities of large language models (LLMs). Recent research has also extended it to multimodal large language models (MLLMs) to enhance multimodal reasoning. However, through systematic error analysis, we find that while RLVR effectively reduces reasoning errors in MLLMs, it fails to address perceptual errors, which often lead to incorrect inference results. Limited visual perception is a major bottleneck in multimodal reasoning. To address this issue, we propose a novel visual perception-enhanced reward model that explicitly encourages accurate visual understanding as a prerequisite for reasoning. Specifically, our approach first incentivizes accurate visual perception prior to reasoning and then assigns a perception-based reward to reinforce correct understanding of the visual input. Extensive experiments on multiple multimodal reasoning benchmarks demonstrate that our approach effectively alleviates the perceptual bottleneck and promotes more reliable multimodal reasoning.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 23983

Loading