Recent text-to-3D generation methods achieve impressive 3D content creation capacity thanks to the advances in image diffusion models and optimizing strategies. However, current methods struggle to generate correct 3D content for a complex prompt in semantics, i.e., a prompt describing multiple interacted objects binding with different attributes. In this work, we propose a general framework named Progressive3D, which decomposes the entire generation into a series of locally progressive editing steps to create precise 3D content for complex prompts, and we constrain the content change to only occur in regions determined by user-defined region prompts in each editing step. Furthermore, we propose an overlapped semantic component suppression technique to encourage the optimization process to focus more on the semantic differences between prompts. Extensive experiments demonstrate that the proposed Progressive3D framework generates precise 3D content for prompts with complex semantics and is general for various text-to-3D methods driven by different 3D representations.
Overview of a local editing step of our proposed Progressive3D. Given a source representation supervised by source prompt, our framework aims to generate a target representation conforming to the input target prompt in 3d space defined by the region prompt. Conditioned on the 2D mask, we constrain the 3D content with region-related constraints. We further propose an Overlapped Semantic Component Suppression technique to impose the optimization focusing more on the semantic difference for precise progressive creation.
Current text-to-3D methods suffer from challenges when given prompts describing multiple objects binding with different attributes. Compared to generating with existing methods, generating with Progressive3D produces 3D content consistent with given prompts.
Generate with current methods | Generate with Progressive3D | |||||||
![]() |
![]() |
![]() |
![]() |
|||||
![]() |
![]() |
![]() |
![]() |
|||||
![]() |
![]() |
![]() |
![]() |
Progressive3D supports different editable region definations since their depth and opacity can be obtained from rendering.
Source content | 3D bounding box | 2D mask | Custom mesh | 2D mask | ||
![]() |
![]() |
Source content | w/o Loss_consisnt | w/o Loss_initial | w/o OSCS | Ours | ||||
More comparison results are provided to demonstrate Progressive3D significantly improve the creation capacity with complex prompts for current text-to-3D methods. For each pair of samples, the left one is the generated content of the original method, and the right one is created by leveraging Progressive3D.
An origami box and a ceramic tea pot on a golden table. | A yellow pineapple in a hexagonal cup on a round cabinet. | |||
A toy robot wearing a golden shirt and a wooden crown. | A model of a round building with square roof on a hexagonal park. | |||
A standing black Shiba Inu wearing a golden sweater and silver boots. | A head of terracotta army wearing a red sunglass and gray hat. |