Existing image-to-3D creation methods typically split the task into multi-view image generation and 3D reconstruction, leading to two main limitations: (1) multi-view bias, where geometric inconsistencies arise because multi-view diffusion models ensure image-level rather than 3D consistency; (2) misaligned reconstruction data, since reconstruction models trained on mostly synthetic data misalign when processing generated multi-view images during inference. To address these issues, we propose Ouroboros3D, a unified framework that integrates multi-view generation and 3D reconstruction into a recursive diffusion process. By incorporating a 3D-aware feedback mechanism, our multi-view diffusion model leverages the explicit 3D information from the reconstruction results of the previous denoising process as conditions, thus modeling consistency at the 3D geometric level. Furthermore, through joint training of both the multi-view diffusion and reconstruction models, we alleviate reconstruction bias due to data misalignment and enable mutual enhancement within the multi-step recursive process. Experimental results demonstrate that Ouroboros3D outperforms methods that treat these stages separately and those that combine them only during inference, achieving superior multi-view consistency and producing 3D models with higher geometric realism.
Concept comparison between Ouroboros3D and previous two-stage methods. Instead of directly combining multi-view diffusion model and reconstruction model, our self-conditioned framework involves joint training of these two models and establish them as a recursive association. At each step of the denoising process, the rendered 3D-aware maps are fed to the multi-view generation in the next step.
Concept of 3D-aware recursive diffusion. During multi-view denoising, the diffusion model uses 3D-aware maps rendered by the reconstruction module at the previous step as conditions.
Overview of Ouroboros3D. In the denoising sampling loop, we decode the predicted x0 to noise-corrupted images, which are then used to recover 3D representation by a feed-forward reconstruction model. Then the rendered color images and coordinates maps are encoded and fed into the next denoising step.