Keywords: VLM, reasoning, Reinforcement Learning
TL;DR: An extensive recipe to sharpen the visual reasoning capabilities of small models with offline policy optimization
Abstract: Multimodal reasoning with small vision–language models (VLMs) is increasingly important in real-world applications, yet their limited capacity makes optimization and alignment especially challenging. In this paper, we propose a holistic framework for offline policy optimization to sharpen the visual reasoning capabilities of small models. At its foundation is TinyEye-Data, a large-scale corpus of two million reasoning trajectories distilled from state-of-the-art VLMs across 68 verifiable tasks, which provides diverse and reliable binary supervision entirely in the offline setting. We instantiate the framework through a four-stage pipeline: (1) native-resolution warm-up for robust vision–language alignment, (2) instruction tuning on TinyEye-Data to establish a broad reasoning foundation, (3) annealed rejection sampling to mine hard cases and refine supervision, and (4) Discriminative Direct Preference Optimization (DDPO), a new margin-based objective that formulates policy learning as reward classification and resolves the likelihood displacement issues of DPO. Stages (3) and (4) together form the core of verifiable offline reinforcement learning, where rejection sampling refines signals and DDPO optimizes the policy against them. The resulting model, TinyEye-2B, achieves state-of-the-art results across diverse reasoning benchmarks, reaching 50.3% on MMMU, 55.2% on MathVerse, and 63.9% on HallBench, outperforming other models of comparable scale by significant margins.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 7663
Loading