Keywords: VLM, Reinforcement-Learning, Generalization
Abstract: Building truly versatile Vision-Language Models requires more than just scaling up training data and model size. While current models achieve impressive performance on their training tasks, they often fail when deployed on problems that require the same underlying skills but in different combinations—a limitation that hinders their adoption as general-purpose AI systems.
This paper introduces a novel reinforcement learning framework that addresses this limitation by teaching a VLM to solve problems by composing a learned sequence of reusable, verifiable subtasks.
Our key innovation is a reward function that guides the model to generate structured reasoning chains of these primitive subtasks through format-based verification, eliminating the need for detailed annotations of intermediate reasoning steps.
This format-based reward provides a dense learning signal, enabling the model to master a flexible, procedural approach to problem-solving.
Furthermore, these models can be flexibly transferred to spatial VQA tasks, demonstrating strong performance without any fine-tuning.
This comprehensive cross-task transfer outperforms both standard supervised fine-tuned and reinforcement fine-tuned visual chain-of-thought baselines, while maintaining computational efficiency with only a 3B parameter model.
Our findings show that learning to compose a sequence of fundamental vision skills is an effective and scalable strategy for building robust, general-purpose VLMs than learning monolithic, task-specific solutions.
Primary Area: reinforcement learning
Submission Number: 18088
Loading