Abstract: Recent advancements in 3D generation have garnered considerable interest due to their potential applications. Despite these advancements, the field faces persistent challenges in multi-conditional control, primarily due to the lack of paired datasets and the inherent complexity of 3D structures. To address these challenges, we introduce ImageBind3D, a novel framework for controllable 3D generation that integrates text, hand-drawn sketches, and depth maps to enhance user controllability. Our innovative contribution is the adoption of an inversion-align strategy, facilitating controllable 3D generation without requiring paired datasets. Firstly, utilizing GET3D as a baseline, our method innovates a 3D inversion technique that synchronizes 2D images with 3D shapes within the latent space of 3D GAN. Subsequently, we leverage images as intermediaries to facilitate pseudo-pairing between the shapes and various modalities. Moreover, our multi-modal diffusion model design strategically aligns external control signals with the generative model's latent knowledge, enabling precise and controllable 3D generation. Extensive experiments validate that ImageBind3D surpasses existing state-of-the-art methods in both fidelity and controllability. Additionally, our approach can offer composable guidance for any feed-forward 3D generative models, significantly enhancing their controllability.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: We propose ImageBind3D, a simple but effective approach that can offer guidance in multiple forms to feed-forward 3D generation models, while not affecting the original network architectures, generation capacity and efficiency. Thanks to ImageBind3D, we can achieve more controllable outcomes, as opposed to the random results generated by GAN-based models (e.g., Get3D, Dreamfusion). Furthermore, ImageBind3D can generate 3D object with composable guidance. Our approach explores the controllability of 3D generation by using different modalities as guiding conditions for multi-3D generation control. This effectively extends the controllability of 3D generation in multimedia, enabling users to generate more accurate and higher-quality 3D objects. This is of significant importance for multimedia systems in exploring the controllability of generated results. This research holds significant importance for multimedia systems in exploring controllability in generated results.
Supplementary Material: zip
Submission Number: 600
Loading