Injecting Vision Language into Autoregressive Image Generation

Yuhe Liu; Songhua Liu; Runpeng Yu; Xinchao Wang

Injecting Vision Language into Autoregressive Image Generation

Yuhe Liu, Songhua Liu, Runpeng Yu, Xinchao Wang

20 Sept 2024 (modified: 19 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: autoregressive models, image generation, text-to-image, customized image generation

TL;DR: In this paper, a new and effective vision-condition introducing framework in AR model is proposed.

Abstract: Autoregressive (AR) models have become central to modern foundation models like large language models (LLMs) and visual-language models (VLMs). Recently, AR-based approaches have extended into text-to-image generation. Although these text-to-image AR models have been trained for visual-language token interaction, they often struggle when conditioned on visual inputs. Focusing on this drawback, in this paper, we are curious about one question: how can we inject vision information to a pre-trained AR model to ensure its output reflects visual conditions? We answer this question with a simple yet effective solution termed InjectAR. Our key insight is that, while a pre-trained AR model cannot handle visual inputs directly, its inherent capability for visual-language interaction can indeed support visual feature extraction. Consequently, with only a few newly introduced parameters and minimal training, a pre-trained AR generation model can successfully accommodate both text and image conditions and produce visually appealing results. To manage the relationship between textual and visual inputs, we reinforce InjectAR with a hierarchical attention mechanism, which subdivides the attention scores for textual tokens into their corresponding visual components, preventing either modality from dominating the output. As the first AR model with this capability, extensive experiments show that InjectAR achieves performance on par with, or even surpasses, state-of-the-art diffusion models. Moreover, unlike diffusion models, once trained, our method has the potential for flexible control over the positions of visual objects. Our codes will be available.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2124

Loading