Visual Implicit Autoregressive Modeling

Pengfei Jiang; Jixiang Luo; Luxi Lin; Zhaohong Huang; Xuelong Li

Visual Implicit Autoregressive Modeling

Pengfei Jiang, Jixiang Luo, Luxi Lin, Zhaohong Huang, Xuelong Li

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Visual Autoregressive Modeling (VAR) based on next-scale prediction achieves strong generation quality, but their explicit deep stacks fix the amount of computation per scale and inflate memory at high resolutions. We introduce Visual Implicit Autoregressive Modeling (VIAR), a next-scale autoregressive generator that embeds an implicit equilibrium layer between shallow pre/post blocks. The implicit layer is trained with Jacobian‑Free Backpropagation, yielding constant training memory, while inference exposes a per‑scale iteration knob that enables compute control. On ImageNet 256 × 256 benchmark, VIAR attains FID 2.16, and sFID 8.07 with only 38.4\% parameters of VAR, matching or surpassing strong AR baselines and remaining competitive with large diffusion models. By controlling the per-scale knob, VIAR can reduce peak memory from 19.24 GB to 8.53 GB and doubles throughput from 15.16 to 32.08 images/s on a single RTX 4090, without retraining. Ablations show that fewer steps are sufficient for fixed-point iterations to converge and that VIAR consistently dominates VAR across quality efficiency operating points. In zero shot in-painting and class‑conditional editing, VIAR produces sharper details and smoother boundaries while preserving global structure, validating the benefits of implicit equilibria and per‑scale compute control for practical, deployable visual generation.

Lay Summary: Modern AI image generators such as VAR often rely on fixed computational budgets and large model sizes to produce high-quality images. This limits their applicability in scenarios where computational cost needs to be adjusted flexibly. We propose VIAR, a more efficient image generation model that replaces many repeated layers with a single reusable refinement module. Instead of doing the same amount of work at every image scale, VIAR can choose how many refinement steps to use, allowing it to save computation when fewer steps are enough. VIAR can generate competitive images while greatly reducing the number of model parameters. It also keeps competitive image in-painting and editing abilities. In addition, VIAR is more flexible to deploy on consumer-level GPUs, since users can adjust its computation at inference time to better balance image quality, memory use, and speed.

Link To Code: https://github.com/mobiushy/VIAR

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: Autoregressive modeling, Visual generation, Deep equilibrium models

Originally Submitted PDF: pdf

Submission Number: 15628

Loading