ReFocus-VAR: Next-Focus Prediction for Visual Autoregressive Modeling

05 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: VAR, multi-scale tokenization, anti-aliasing, cross-attention
TL;DR: We split aliasing into low-pass structure and alias residual before tokenization, then fuse them with lightweight cross-attention to reduce jaggies/moire while keeping VAR training unchanged.
Abstract: Visual autoregressive models like VAR achieve impressive generation quality through next-scale prediction over multi-scale token pyramids. However, the standard approach constructs these pyramids using pure digital downsampling, which introduces aliasing artifacts that degrade fine details and create unwanted jaggies and moiré patterns. We present ReFocus-VAR, which fundamentally reframes the paradigm from next-scale prediction to next-focus prediction, mimicking the natural process of camera focusing from blur to clarity. Our approach introduces three key innovations: Next-Focus Prediction Paradigm that transforms multi-scale autoregression by progressively reducing blur rather than simply downsampling; Progressive Refocusing Pyramid Construction that uses physics-consistent defocus kernels to build clean, alias-free multi-scale representations; and High-Frequency Residual Learning that employs a specialized residual teacher network to effectively incorporate alias information during training while maintaining deployment simplicity. Specifically, we construct optical low-pass views using defocus PSF kernels with decreasing radius, creating smooth blur-to-clarity transitions that eliminate aliasing at its source. To further enhance detail generation, we introduce a High-Frequency Residual Teacher that learns from both clean structure and alias residuals, distilling this knowledge to a vanilla VAR deployment network for seamless inference. Extensive experiments on ImageNet demonstrate that ReFocus-VAR substantially reduces aliasing artifacts, improves fine detail preservation, and enhances text readability, achieving superior performance with perfect compatibility to existing VAR frameworks.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 2336
Loading