Abstract: Reconstructing 3D objects from a single image guided by pretrained diffusion models has demonstrated promising outcomes. However, due to utilizing the case-agnostic rigid strategy, their generalization ability to arbitrary cases and the 3D consistency of reconstruction are still poor. In this work, we propose Consistent123, a case-aware two-stage method for highly consistent 3D asset reconstruction from one image with both 2D and 3D diffusion priors. In the first stage, Consistent123 utilizes only 3D structural priors for sufficient geometry exploitation, with a CLIP-based case-aware adaptive detection mechanism embedded within this process. In the second stage, 2D texture priors are introduced and progressively take on a dominant guiding role, delicately sculpting the details of the 3D model. Consistent123 aligns more closely with the evolving trends in guidance requirements, adaptively providing adequate 3D geometric initialization and suitable 2D texture refinement for different objects. Consistent123 can obtain highly 3D-consistent reconstruction and exhibits strong generalization ability across various objects. Qualitative and quantitative experiments show that our method significantly outperforms state-of-the-art image-to-3D methods.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Generation] Multimedia Foundation Models
Relevance To Conference: This work contributes to multimedia and multimodal processing by advancing the field of 3D reconstruction from single images. By automating the process of generating highly consistent 3D models from a single image, this research reduces the manual effort required by experienced 3D artists. The ability to efficiently create detailed 3D models from images opens up opportunities for applications in various multimedia domains, including gaming, virtual reality (VR), and augmented reality (AR).
By leveraging advanced computational techniques, such as deep learning or computer vision algorithms, this study offers a significant advancement in multimodal processing. It enables the integration of visual information from a single image into a rich 3D representation, facilitating immersive experiences in virtual three-dimensional environments. Ultimately, this contribution enhances the accessibility and efficiency of 3D content creation, thereby enriching multimedia experiences across different platforms and applications.
Supplementary Material: zip
Submission Number: 2132
Loading