3D-GENERALIST: Vision-Language-Action Models for Crafting 3D Worlds

Published: 05 Nov 2025, Last Modified: 30 Jan 20263DV 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D scene generation, vision-language models
TL;DR: We introduce 3D-Generalist, a framework that formulates 3D world generation as a sequential decision-making problem and employs Vision-Language Models as the policy.
Abstract: Creating 3D graphics content for immersive and interactive worlds remains labor-intensive, limiting our ability to create large-scale synthetic data that can serve as training data to foundation models. Recent methods have been proposed to alleviate this, but they often focus on one particular aspect (e.g., layout) and fail to improve the quality of the generation through scaling computational resources. In this work, we recast 3D environment generation as a sequential decision-making problem, employing Vision-Language-Models (VLMs) as policies that output actions to jointly craft a 3D environment's layout, materials, lighting, and assets. Our proposed framework, 3D-Generalist, trains VLMs to generate more prompt-aligned 3D environments via self-improvement fine-tuning. We demonstrate the effectiveness of 3D-Generalist and the proposed training strategy in generating simulation-ready 3D environments. Furthermore, we demonstrate its quality and scalability in synthetic data generation by pretraining a vision foundation model on the generated data. After fine-tuning the pre-trained model on downstream tasks, we show that it surpasses models pre-trained on meticulously human-crafted synthetic data and approaches results achieved with real data orders of magnitude larger.
Supplementary Material: pdf
Submission Number: 176
Loading