Exploring the Design Space of Autoregressive Models for Efficient and Scalable Image Generation

Yi Xin; Le Zhuo; Qin Qi; Binglei Li; Xu Guo; Siqi Luo; Chang Xu; Xiaohong Liu; Peng Gao

Exploring the Design Space of Autoregressive Models for Efficient and Scalable Image Generation

Yi Xin, Le Zhuo, Qin Qi, Binglei Li, Xu Guo, Siqi Luo, Chang Xu, Xiaohong Liu, Peng Gao

17 Sept 2024 (modified: 13 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Image Generation, Autoregressive Model

TL;DR: We explore the design space of mask autoregressive models to achieve efficient and scalable image generation.

Abstract: Autoregressive (AR) models and their variants are re-revolutionizing visual generation with improved frameworks. However, unlike the well-established practices for building diffusion models, there lacks a comprehensive recipe for AR models, e.g., selecting image tokenizers, model architectures, and AR paradigms. In this work, we delve into the design space of general AR models, including Mask Autoregressive (MAR) models, to identify optimal configurations for efficient and scalable image generation. We first conduct a detailed evaluation of four prevalent image tokenizers across both AR and MAR settings, examining the impact of codebook size (ranging from 1,024 to 262,144) on generation quality, and identify the most effective tokenizer for image generation. Building on these insights, we propose an enhanced MAR model architecture, named Masked Generative Image LLaMA (MaskGIL), comprising of LlamaGen-VQ and Bidirectional LLaMA. To ensure stable scaling, we introduce modifications such as query-key normalization and post-normalization, resulting in a series of class-conditional MaskGIL models, ranging from 111M to 1.4B parameters. MaskGIL significantly improves the MAR baseline, achieving an 3.71 FID comparable to state-of-the-art AR models on the ImageNet 256$\times$256 benchmark, with only 8 inference steps, far fewer than the 256 steps needed for AR models. Additionally, we introduce a text-conditional MaskGIL model with 775M parameters, capable of flexibly generating images at any resolution with high aesthetics. To bridge AR and MAR image generation, we investigate their combination during the inference phase. We release all models and code to foster further research.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1299

Loading