A General Framework to Boost 3D GS Initialization for Text-to-3D Generation by Lexical Richness

Lutao Jiang; Hangyu Li; Lin Wang

A General Framework to Boost 3D GS Initialization for Text-to-3D Generation by Lexical Richness

Lutao Jiang, Hangyu Li, Lin Wang

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Text-to-3D content creation has recently received much attention, especially with the prevalence of 3D Gaussians Splatting (3D GS). In general, GS-based methods comprise two key stages: initialization and rendering optimization. To achieve initialization, existing works directly apply random sphere initialization or 3D diffusion models, e.g., Point-E, to derive the initial shapes. However, such strategies suffer from two critical yet challenging problems: 1) the final shapes are still similar to the initial ones even after training; 2) shapes can be produced only from simple texts, \eg, '*a dog*', not for lexically richer (or harder) texts, \eg, `*a dog is sitting on the top of the airplane*'. To address these problems, this paper proposes a novel general framework to boost the 3D GS Initialization for text-to-3D generation upon the lexical richness. Our key idea is to aggregate 3D Gaussians into spatially uniform voxels to represent complex shapes while enabling the spatial interaction among the 3D Gaussians and semantic interaction between Gaussians and texts. Specifically, we first construct a voxelized representation, where each voxel holds a 3D Gaussian with its position, scale, and rotation fixed while setting opacity as the sole factor to determine a position's occupancy. We then design an initialization network mainly consisting of two novel components: 1) Global Information Perception (GIP) block and 2) Gaussians-Text Fusion (GTF) block. Such a design enables each 3D Gaussian to assimilate the spatial information from other areas and semantic information from texts. Extensive experiments show the superiority of our framework of high-quality 3D GS initialization against the existing methods, e.g., Shap-E, by taking lexically *simple*, *medium*, and *hard* texts. Also, our framework can be seamlessly plugged into state-of-the-art training frameworks, e.g., LucidDreamer for semantically consistent text-to-3D generation. Our code will be released upon acceptance.

Primary Subject Area: [Generation] Generative Multimedia

Secondary Subject Area: [Experience] Multimedia Applications, [Content] Vision and Language

Relevance To Conference: 3D asset creation finds its applications in the realms of multimedia, such as games and Metaverse. Text-to-3D is one of the pivotal techniques that makes it possible for casual users to create semantically consistent 3D content with text inputs.

Supplementary Material: zip

Submission Number: 1710

Loading