PlacidDreamer: Advancing Harmony in Text-to-3D Generation

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recently, text-to-3D generation has attracted significant attention, resulting in notable performance enhancements. Previous methods utilize end-to-end 3D generation models for initializing 3D Gaussians, and multi-view diffusion models to enforce multi-view consistency. Moreover, they employ text-to-image diffusion models to refine details with score distillation algorithms. However, these methods exhibit two limitations. Firstly, they encounter conflicts in generation directions since different models aim to produce diverse 3D assets. Secondly, the issue of over-saturation in score distillation has not been thoroughly investigated and solved. To address these limitations, we propose PlacidDreamer, a text-to-3D framework that harmonizes initialization, multi-view generation, and text-conditioned generation with a single multi-view diffusion model, while simultaneously employing a novel score distillation algorithm to achieve balanced saturation. To unify the generation direction, we introduce the Latent-Plane module, a training-friendly plug-in extension that enables multi-view diffusion models to provide fast geometry reconstruction for initialization and enhanced multi-view images to personalize the text-to-image diffusion model. To address the over-saturation problem, we propose to view score distillation as a multi-objective optimization problem and introduce the Balanced Score Distillation algorithm, which offers a Pareto Optimal solution that achieves both rich details and balanced saturation. Extensive experiments validate the outstanding capabilities of our PlacidDreamer. The code will be available on GitHub. Code will be available on Github.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This work offers insights into generating 3D assets from text, proposing a cross-modal generative model. Our primary contribution lies in effectively bridging the gap between language, 2D vision, and 3D vision. In this way, we believe our work can make a contribution to multimodal processing.
Supplementary Material: zip
Submission Number: 3200
Loading