Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY-NC 4.0
TL;DR: A data generation a framework to synthesize vision-centric problems spanning diverse levels of complexity, and dataset with over 1M high-quality problems including supporting SFT, offline and online RL.
Abstract: Despite rapid progress, multimodal reasoning still lacks a systematic approach to synthesize large-scale vision-centric datasets beyond visual math. We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts supporting SFT, offline and online RL. Our vision-centric synthesis framework uses a two-stage process focusing on: (1) generating diverse verifiable questions from existing images at scale, and (2) creating complex compositional visual problems by merging simpler questions. Remarkably, finetuning Qwen2.5-VL-7B on our data outperforms existing open-data baselines across evaluated vision-centric benchmarks, and our best configurations match or surpass strong closed-data models such as MiMo-VL-7B-RL on V*Bench, CV-Bench and MMStar-V. Notably, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro, +3.7%) and audio reasoning (MMAU, +1.32%), demonstrating its effectiveness. Similarly, despite containing no embodied visual data, we observe notable gains (NiEH, +8.8%) when evaluating open-ended embodied QA. Lastly, we use our data to comprehensively analyze at scale (1M+) the entire VLM post-training pipeline showing that (i) SFT on high-quality data with cognitive behaviours on reasoning traces is essential to scale online RL, (ii) offline RL could match online RL’s performance while disaggregating compute demands, and, (iii) SFT on high quality data also improve out-of-domain, cross-modality transfer.
Lay Summary: This work introduces a way to automatically create over one million image-based questions that help multimodal AI systems reason better about what they see. By focusing on specific objects and combining simple questions into harder ones, the method generates data that teaches models to reasoning about what they see, check their answers and recover from mistakes. This improves performance on visual tasks and also shows benefits for text, audio reasoning and emboddied reasoning.
Link To Code: https://huggingface.co/datasets/nvidia/nemotron-research-lgt
Primary Area: Deep Learning->Foundation Models
Keywords: Synthetic Data, Visual Reasoning, VLM, SFT, RL, Reasoning Models
Originally Submitted PDF: pdf
Submission Number: 8421
Loading