Keywords: Vision Language Model, Unified Model, Reinforment Learning
Abstract: Reinforcement learning (RL) has primarily advanced the reasoning capabilities of Vision-Language Models (VLMs) in multi-modal scenarios, and some recent works have explored using RL to enhance their perception abilities. However, developing a unified RL framework to handle both task types simultaneously confronts a critical bottleneck: task interference driven by heterogeneous tasks. We observe that the interference typically manifests as training instability and ambiguous responses, ultimately constraining the effectiveness of unified multi-task training. To address this challenge, we propose \textbf{DualRPO} (Dual Rewards Policy Optimization), a novel RL paradigm that synergistically integrates internal self-certainty rewards and external verifiable rewards. DualRPO embeds self-certainty, defined as the average KL divergence between the model’s output distribution and a uniform distribution, in reward shaping: \textit{it amplifies external rewards for correct yet underconfident outputs and penalizes external rewards for incorrect but overconfident ones}, guiding the model to generate accurate and confidence-calibrated responses. Extensive experiments validate the efficacy of DualRPO. We evaluate across {8} heterogeneous tasks (5 perception: chart analysis, detection, grounding, counting, OCR; 3 reasoning: math, puzzle, science). DualRPO delivers a large performance improvement across all tasks, with training instability amplitude reduced by. These results highlight that the proposed DualRPO enables unified scaling of multi-modal models to diverse perceptual and cognitive tasks.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24008
Loading