Stable Control Visual AutoRegressive Model: Precise and Efficient Image Generation via Scale Alignment

Feng Xie, Dahua Gao, Ruichao Liu, Minxi Yang, Yibo Zhang, Wenlong Wang

Published: 2025, Last Modified: 09 Nov 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Although diffusion models advance condition-based visual generation, they suffer from speed and cost issues, unlike faster AutoRegressive methods that are limited in performance. To address these, we introduce the Stable Control Visual AutoRegressive Model (SCVAR). SCVAR ensures stable control by aligning visual conditions on multiple scales. Rather than unfolding the 2D image into a 1D raster, SCVAR decouples it into multiple scales. This shifts the sequential representation in SCVAR from tokens to scales, satisfying the unidirectional dependency of the AR model while preserving the 2D structure of the image. Compared to indiscriminate conditional guidance, cross-scale alignment provides more precise constraints, enabling SCVAR to achieve state-of-the-art performance in experiments against diffusion models, with 10x faster generation speed. The decoupled condition also reduces training costs. Compared to end-to-end conditional computation, experiments demonstrate that SCVAR matches performance with only 40% additional parameters.