Keywords: RL for VLMs; Unified Training
TL;DR: We propose V-Triune, a unified reinforcement learning system that enables a single VLM to jointly learn and show significant improvements on both visual reasoning and perception tasks.
Abstract: Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, its application beyond reasoning remains largely unexplored, especially for perception-intensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises three complementary components: Sample-Level Data Formatting to unify diverse inputs, Verifier-Level Reward Computation to deliver modular rewards via specialized verifiers, and Source-Level Metric Monitoring to enable fine-grained diagnostics. A key innovation within the verifier component is the proposed Dynamic IoU reward, which provides adaptive and progressive feedback for several perception tasks. Leveraging V-Triune, we develop Orsta (7B, 32B), a family of models built upon open-source backbones. Jointly training Orsta on a diverse dataset of eight representative reasoning (math, puzzle, etc.) and perception (detection, grounding, etc.) tasks leads to consistent improvements across both domains. As a result, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to +14.1 over its baselines, and these benefits extend to a wide range of downstream tasks. These results establish V-Triune as an effective and scalable system for building more comprehensive VLMs. Code is provided in the supplementary materials.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11032
Loading