Keywords: Novel view synthesis, Diffusion Model, Closed-Loop Transcription
Abstract: Based on the success of large image diffusion models, multi-view diffusion models have demonstrated remarkable zero-shot capability in novel view synthesis (NVS). However, the pioneering work Zero123 struggles to maintain consistency across generated multiple views. While recent modifications in model and training design have improved multi-view consistency, they often introduce new limitations, such as restricted fixed view generation or reliance on additional conditions. These constraints hinder the broader application of multi-view diffusion models in downstream tasks like 3D reconstruction. We identify the root cause of inconsistency as the excessive diversity inherent in generative models utilized for the NVS task. To address this, we aim to utilize the stronger supervise information to better alignment with ground truth images to constrain the diversity, and propose Ctrl123, a **closed-loop** transcription-based multi-view diffusion method that enforces alignment in the CLIP patch feature space. Extensive experiments demonstrate that Ctrl123 excels in **arbitrary** novel view generation, significantly improving multi-view consistency compared to existing methods.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1998
Loading