Fundamental Limits of Visual Autoregressive Transformers: Universal Approximation Abilities

Yifang Chen; Xiaoyu Li; Yingyu Liang; Zhenmei Shi; Zhao Song

Fundamental Limits of Visual Autoregressive Transformers: Universal Approximation Abilities

Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We investigate the fundamental limits of transformer-based foundation models, extending our analysis to include Visual Autoregressive (VAR) transformers. VAR represents a big step toward generating images using a novel, scalable, coarse-to-fine ``next-scale prediction'' framework. These models set a new quality bar, outperforming all previous methods, including Diffusion Transformers, while having state-of-the-art performance for image synthesis tasks. Our primary contributions establish that, for single-head VAR transformers with a single self-attention layer and single interpolation layer, the VAR Transformer is universal. From the statistical perspective, we prove that such simple VAR transformers are universal approximators for any word-to-image Lipschitz functions. Furthermore, we demonstrate that flow-based autoregressive transformers inherit similar approximation capabilities. Our results provide important design principles for effective and computationally efficient VAR Transformer strategies that can be used to extend their utility to more sophisticated VAR models in image generation and other related areas.

Lay Summary: A new model called Visual Autoregressive (VAR) Transformer – winner of a NeurIPS 2024 Best Paper award—generates images by starting with a rough sketch and then repeatedly “fills in” finer details. Our study asks a simple question: How powerful is VAR at its core? We prove that a VAR Transformer can, in theory, generate any reasonable mapping from words to pictures. In other words, it is a universal image builder. Demonstrating universality for such a lightweight architecture suggests that we can design smaller, faster image generators without sacrificing expressive power. This opens the door to more efficient creative tools on everyday devices and provides a solid theoretical foundation for the next wave of visual AI.

Primary Area: Deep Learning->Algorithms

Keywords: Universal Approximation, Visual AutoRegressive Transformers, Fundamental Limits

Submission Number: 4470

Loading