Keywords: vision, language, reasoning, RL
TL;DR: Open-source VLMs trained with a single-stage RL recipe and diverse data (600K samples, 6 task categories) can match or beat proprietary RL pipelines.
Abstract: The strongest vision-language models (VLMs) rely on proprietary reinforcement learning (RL) pipelines, while broad multi-task RL remains difficult because heterogeneous visual problems transfer weakly across tasks. We introduce Vero, a family of fully-open VLMs trained with a carefully curated collection of 600K RL samples from 59 datasets spanning six core task categories. Vero achieves state-of-the-art performance across a wide range of visual reasoning tasks, improving over four base models by 3.6–5.5 points on average across 30 challenging benchmarks spanning the six core task categories. Starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without using any additional proprietary thinking data. On MiMo-VL-SFT, Vero surpasses MiMo-VL-RL, which relies on a proprietary RL recipe. Systematic ablations reveal that different task categories elicit qualitatively distinct reasoning patterns that transfer poorly in isolation, suggesting that broad data coverage is the primary driver of strong RL scaling. All data, code, and models are released.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 15
Loading