GenUSD: 3D scene generation made easy

Tsung-Yi Lin, Chen-Hsuan Lin, Yin Cui, Yunhao Ge, Seungjun Nah, Arun Mallya, Zekun Hao, Yifan Ding, Hanzi Mao, Zhaoshuo Li, Yen-Chen Lin, Xiaohui Zeng, Qinsheng Zhang, Donglai Xiang, Qianli Ma, J.P. Lewis, Jingyi Jin, Pooya Jannaty, Ming-Yu Liu

Published: 25 Jul 2024, Last Modified: 13 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: We introduce GenUSD, an end-to-end text-to-scene generation framework that transforms natural language queries into realistic 3D scenes, including 3D objects and layouts. The process involves two main steps: 1) A Large Language Model (LLM) generates a scene layout hierarchically. It first proposes a high-level plan to decompose the scene into multiple functionally and spatially distinct subscenes. Then, for each subscene, the LLM proposes objects with detailed positions, poses, sizes, and descriptions. To manage complex object relationships and intricate scenes, we introduce object layout design meta functions as tools for the LLM. 2) A novel text-to-3D model generates each 3D object with surface meshes and high-resolution texture maps based on the LLM’s descriptions. The assembled 3D assets form the final 3D scene, represented as a Universal Scene Description (USD) format. GenUSD ensures physical plausibility by incorporating functions to prevent collisions.

External IDs:doi:10.1145/3641520.3665306