Keywords: Synthetic Data, Visual Reasoning, VLM, LLM, RL, Reasoning Models
TL;DR: A data generation framework for visual reasoning spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions.
Abstract: Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale, where imagery and metadata (captions, bounding boxes) are used to generate diverse, verifiable visual questions; and (2) complexity, where a composition hardening algorithm merges simpler questions from the previous stage into harder, still verifiable visual problems. Reasoning traces are synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models.We show that finetuning Qwen2.5-7B-VL on our data yields significant gains over both the base models and strong baselines. Remarkably, our 7B model outperforms all open-data baselines across all evaluated vision-centric benchmark, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V*Bench, CV Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro, +2.98\%) and audio understanding and reasoning (MMAU, +1.32\%), demonstrating its effectiveness. Finally, our comprehensive empirical analysis highlights that SFT on high-quality data is essential for effective online RL, staged offline RL matches online RL’s performance while reducing compute demands, and, notably, careful SFT can substantially improve out-of-domain, cross-modality transfer.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5481
Loading