DualRPO: All-in-one Visual RL with Internal and External Rewards

Sidi Yang; Chaofan Tao; Tiezheng YU; Jierun Chen; Haoli Bai; Lifeng Shang; Ngai Wong

DualRPO: All-in-one Visual RL with Internal and External Rewards

Sidi Yang, Chaofan Tao, Tiezheng YU, Jierun Chen, Haoli Bai, Lifeng Shang, Ngai Wong

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Language Model, Unified Model, Reinforment Learning

Abstract: Reinforcement learning (RL) has primarily advanced the reasoning capabilities of Vision-Language Models (VLMs) in multi-modal scenarios, and some recent works have explored using RL to enhance their perception abilities. However, developing a unified RL framework to handle both task types simultaneously confronts a critical bottleneck: task interference driven by heterogeneous tasks. We observe that the interference typically manifests as training instability and ambiguous responses, ultimately constraining the effectiveness of unified multi-task training. To address this challenge, we propose \textbf{DualRPO} (Dual Rewards Policy Optimization), a novel RL paradigm that synergistically integrates internal self-certainty rewards and external verifiable rewards. DualRPO embeds self-certainty, defined as the average KL divergence between the model’s output distribution and a uniform distribution, in reward shaping: \textit{it amplifies external rewards for correct yet underconfident outputs and penalizes external rewards for incorrect but overconfident ones}, guiding the model to generate accurate and confidence-calibrated responses. Extensive experiments validate the efficacy of DualRPO. We evaluate across {8} heterogeneous tasks (5 perception: chart analysis, detection, grounding, counting, OCR; 3 reasoning: math, puzzle, science). DualRPO delivers a large performance improvement across all tasks, with training instability amplitude reduced by. These results highlight that the proposed DualRPO enables unified scaling of multi-modal models to diverse perceptual and cognitive tasks.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 24008

Loading