BAR: Refactor the Basis of Autoregressive Visual Generation

ICLR 2026 Conference Submission15950 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Autoregressive Models, Autoregressive Visual Generation
TL;DR: This work proposes BAR to adaptively refactor tokens, learn transforms, and achieve SOTA on image generation.
Abstract: Autoregressive (AR) models, despite their remarkable successes, encounter limitations in image generation due to sequential prediction of tokens, e.g. local image patches, in a predetermined row-major raster-scan order. Prior works improve AR with various designs of prediction units and orders, however, rely on human inductive biases. This work proposes Basis Autoregressive (BAR), a novel paradigm that conceptualizes tokens as basis vectors within the image space and employs an end-to-end learnable approach to transform basis. By viewing tokens $x_k$ as the projection of image $\mathbf{x}$ onto basis vectors $e_k$, BAR's unified framework refactors fixed token sequences through the linear transform $\mathbf{y}=\mathbf{Ax}$, and encompasses previous methods as specific instances of matrix $\mathbf{A}$. Furthermore, BAR adaptively optimizes the transform matrix with an end-to-end AR objective, thereby discovering effective strategies beyond hand-crafted assumptions. Comprehensive experiments, notably achieving a state-of-the-art FID of 1.15 on the ImageNet-256 benchmark, demonstrate the ability of BAR to overcome human biases and significantly advance image generation, including text-to-image synthesis.
Primary Area: generative models
Submission Number: 15950
Loading