Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Token-Shuffle, Auto-Regressive model, Image Generation
Abstract: AutoRegressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often seen as less competitive than diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency as well as image resolution. To address this, we present Token-Shuffle, a novel yet simple method that reduces the visual token number in causal masked Transformer architectures. Our motivation is the dimensional redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs). Leveraging this, our method employs two key operations: token-shuffle, which merges spatially local tokens along channel dimension to decrease the token number, and token-unshuffle, which untangles the inferred tokens after Transformer blocks to restore the spatial arrangement for output. Jointly training with textual prompts, our strategy requires no additional pretrained text-encoder, and enables MLLMs to support extremely high-resolution image synthesis (over 1k resolution) in a unified next-token prediction way while maintaining efficient training and inference. For the first time, we push the boundary of AR text-to-image generation to resolution of 2048 × 2048, and set new baselins for AR-model image generation. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming the previous autore-gressive models LlamaGen by 0.18 and diffusion models LDM by 0.15. Based on large-scale human evaluation, we demonstrate that a pure AR-model can provide comparative or even better image generation quality as Diffusion-model and generate high-resolution images simultaneously.
Primary Area: generative models
Submission Number: 8462
Loading