On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

18 Sept 2025 (modified: 01 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: generative model, complexity, fast attention
Abstract: Recently, Visual Autoregressive ($\mathsf{VAR}$) Models introduced a groundbreaking advancement in the field of image generation, offering a scalable approach through a coarse-to-fine ``next-scale prediction'' paradigm. Suppose that $n$ represents the height and width of the last VQ code map generated by $\mathsf{VAR}$ models, the state-of-the-art algorithm in [Tian, Jiang, Yuan, Peng and Wang, NeurIPS 2024] takes $O(n^{4+o(1)})$ time, which is computationally inefficient. In this work, we analyze the performance bottleneck and hardness result of $\mathsf{VAR}$ Models through a fine-grained complexity lens. Our key contribution is identifying the conditions under which $\mathsf{VAR}$ computations can achieve sub-quadratic time complexity. We have proved that assuming the Strong Exponential Time Hypothesis ($\mathsf{SETH}$) from fine-grained complexity theory, a sub-quartic time algorithm for $\mathsf{VAR}$ models is impossible. To substantiate our theoretical findings, we present efficient constructions leveraging low-rank approximations that align with the derived criteria. Formally, suppose that $n$ is the height and width of the last VQ code map in $\mathsf{VAR}$ models, $d$ is the hidden dimension, $R$ is the bound of the entries of the input matrices for attention calculations in $\mathsf{VAR}$ models. We present two results: On the negative side, we show that when $d = O(\log n)$ and $R = \Theta(\sqrt{\log n})$, assuming $\mathsf{SETH}$, it is impossible to approximate the output of $\mathsf{VAR}$ model up to $1/\mathrm{poly}(n)$ additive error in truly sub-quartic time $O(n^{4-\Omega(1)})$. On the positive side, we demonstrate a sharp phase transition in the computational complexity of the $\mathsf{VAR}$ model. When $d = O(\log n)$ and $R = o(\sqrt{\log n})$, the runtime improves dramatically from $O(n^{4+o(1)})$ to $O(n^{2+o(1)})$, which approximates the output of the output of $\mathsf{VAR}$ model up to $1/\mathrm{poly}(n)$ additive error. This work initiates the study of the computational efficiency of the $\mathsf{VAR}$ model from a theoretical perspective. Our technique will shed light on advancing scalable and efficient image generation in $\mathsf{VAR}$ frameworks.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13332
Loading