AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction

23 Sept 2024 (modified: 22 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-View Synthesis; Image-to-3D; Autoregressive Generation;
TL;DR: we propose to generate all novel target views based on the input image via an autoregressive manner.
Abstract: Represented by Zero123 series of works, recent advancements in single-view 3D generation research have shown prominent progress by utilizing pre-trained 2D diffusion generation models. These approaches either generate multiple discrete views of a 3D object from a single-view image and a set of camera poses or produce multiple views simultaneously under specified camera conditions. However, it is hard to maintain consistency across different views and camera angles, especially for poses with large differences. In this paper, we introduce AR-1-to-3, a novel paradigm to generate multi-view images according to the input single image with significantly improved consistency in details. We achieve this by designing a novel auto-regressive scheme where novel views are generated based on previous views. The core of our method is first to generate views closer to the input view, which is utilized as contextual information to prompt the generation of farther views. To this end, we propose two image conditioning strategies, termed as Stacked-LE and LSTM-GE, to encode the sequence views. Particularly, Stacked-LE encodes the previously generated views into a stack embedding, which is employed as a local condition to modify the key and value matrices of the self-attention layers for denoising the target views of the current step. Meanwhile, LSTM-GE divides the previously generated views into two groups based on their elevations, whose feature vectors are encoded by two LSTM modules into high-level semantic information for global conditioning. Extensive experiments on the Objaverse dataset show that our method can synthesize more consistent 3D views and produce high-quality 3D assets that closely mirror the given image. Code and pre-trained weights will be made publicly available.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2744
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview