Keywords: 3D scene generation, VLM agent
Abstract: Vision-language models (VLMs) have made rapid progress in multimodal generation, but extending them to structured 3D scene construction requires addressing three key challenges: (1) unifying diverse inputs into a semantic representation that captures global layout, environment setup, object-level appearance, and constraints for 3D object reconstruction, (2) encoding object-object and object-environment interactions, and (3) enabling accurate control over object placement, appearance, and reconstruction. We introduce an agentic framework for 3D world generation that tackles these challenges through a spatially contextualized design. A scene portrait integrates multimodal inputs into a semantic blueprint with sub-descriptions and reconstruction constraints, consisting of text, image appearance, and partial 3D point clouds. A scene hypergraph encodes rich spatial relations, capturing both object-object interactions and object-environment interactions that guide object placement. Finally, geometric reconstruction with ergonomic adjustment delivers accurate 3D reconstruction while refining object shape, scale, and placement through optimization. These components are supported by an auto-verification agent that ensures the generated reconstruction satisfies the input constraints. Together, they form a structured spatial context that the VLM iteratively reads and updates, enabling coherent, editable, and semantically aligned 3D environments. Experiments demonstrate strong generalization across diverse inputs and show that spatial context injection empowers VLMs with downstream capabilities such as interactive scene editing and path planning, advancing spatially intelligent systems in graphics and 3D vision.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 1427
Loading