Harnessing Attention Prior for Reference-based Multi-view Image Synthesis

16 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Reference-guided inpainting, Novel view synthesis, Text-to-Image, Attention
TL;DR: We reformulate the reference-guided inpainting and novel view synthesis as a multi-view contextual inpainting task, which could be effectively addressed by large text-to-image models (such as StableDiffusion) with their powerful attention modules.
Abstract: This paper explores the domain of multi-view image synthesis, aiming to create specific image elements or entire scenes while ensuring visual consistency with reference images. We categorize this task into two approaches: local synthesis, guided by structural cues from reference images (Reference-based inpainting, Ref-inpainting), and global synthesis, which generates entirely new images based solely on reference examples (Novel View Synthesis, NVS). In recent years, Text-to-Image (T2I) generative models have gained attention in various domains. However, adapting them for multi-view synthesis is challenging due to the intricate correlations between reference and target images. To address these challenges efficiently, we introduce Attention Reactivated Contextual Inpainting (ARCI), a unified approach that reformulates both local and global reference-based multi-view synthesis as contextual inpainting, which is enhanced with pre-existing attention mechanisms in T2I models. Formally, self-attention is leveraged to learn feature correlations across different reference views, while cross-attention is utilized to control the generation through prompt tuning. Our contributions of ARCI, built upon the StableDiffusion fine-tuned for text-guided inpainting, include skillfully handling difficult multi-view synthesis tasks with off-the-shelf T2I models, introducing task and view-specific prompt tuning for generative control, achieving end-to-end Ref-inpainting, and implementing block causal masking for autoregressive NVS. We also show the versatility of ARCI by extending it to multi-view generation for superior consistency with the same architecture, which has also been validated through extensive experiments.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 666
Loading