Visual Autoregressive Modeling for Image Super-Resolution

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
TL;DR: This paper introduces VARSR, a novel visual autoregressive framework for image super-resolution.
Abstract: Image Super-Resolution (ISR) has seen significant progress with the introduction of remarkable generative models. However, challenges such as the trade-off issues between fidelity and realism, as well as computational complexity, have also posed limitations on their application. Building upon the tremendous success of autoregressive models in the language domain, we propose \textbf{VARSR}, a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction. To effectively integrate and preserve semantic information in low-resolution images, we propose using prefix tokens to incorporate the condition. Scale-aligned Rotary Positional Encodings are introduced to capture spatial structures and the diffusion refiner is utilized for modeling quantization residual loss to achieve pixel-level fidelity. Image-based Classifier-free Guidance is proposed to guide the generation of more realistic images. Furthermore, we collect large-scale data and design a training process to obtain robust generative priors. Quantitative and qualitative results show that VARSR is capable of generating high-fidelity and high-realism images with more efficiency than diffusion-based methods. Our codes are released at \url{https://github.com/quyp2000/VARSR}.
Lay Summary: In reality, images often undergo various degradations and damages during processes such as capturing and transmission. The image super-resolution problem aims to generate high-quality images that are faithful to the original from these degraded low-resolution images. Past methods have faced challenges in balancing fidelity and quality, often requiring lengthy computational times. Our work explores a novel progressive image restoration paradigm, VARSR, where we first generate an image at a coarse low-resolution scale and then progressively refine it from coarse to fine based on the content generated in the previous step. We have introduced information and spatial structures from degraded images in a more effective and efficient manner to ensure that the generated images have fully faithful content. We have also implemented re-guidance during the image generation process to ensure the production of higher-quality images. Our method has achieved surprising results, enabling the generation of high-fidelity and realistic images at a faster pace. It demonstrates excellent performance across different scenarios and content types.
Link To Code: https://github.com/quyp2000/VARSR
Primary Area: Applications->Computer Vision
Keywords: visual autoregressive modeling; image super-resolution
Submission Number: 139
Loading