Abstract: Autoregressive (AR) modeling has achieved remarkable success in natural language processing by enabling models to generate text with coherence and contextual understanding through next token prediction. Recently, in image generation, VAR proposes scale-wise autoregressive modeling, which extends the next token prediction to the next scale prediction, preserving the 2D structure of images. However, VAR encounters two primary challenges: (1) its complex and rigid scale design limits generalization in next scale prediction, and (2) the generator’s dependence on a discrete tokenizer with the same complex scale structure restricts modularity and flexibility in updating the tokenizer. To address these limitations, we introduce FlowAR, a general next scale prediction method featuring a streamlined scale design, where each subsequent scale is simply double the previous one. This eliminates the need for VAR’s intricate multi-scale residual tokenizer and enables the use of any off-the-shelf Variational AutoEncoder (VAE). Our simplified design enhances generalization in next scale prediction and facilitates the integration of Flow Matching for high-quality image synthesis. We validate the effectiveness of FlowAR on the challenging ImageNet-256 benchmark, demonstrating superior generation performance compared to previous methods. Codes is available at \href{https://github.com/OliverRensu/FlowAR}{https://github.com/OliverRensu/FlowAR}.
Lay Summary: Modern image-generation methods often build pictures in stages—first sketching a rough outline at low resolution, then filling in details at finer scales. However, existing approaches rely on complex, custom tokenizers that are difficult to update and don’t always generalize well to new resolutions. In this work, we introduce FlowAR, a streamlined technique that doubles the image size at each step (e.g., from 64×64 to 128×128) and plugs into any standard image encoder–decoder system.
By replacing elaborate, scale-specific components with a simple “next-scale” predictor, FlowAR becomes more flexible: you can swap in improved encoders without redesigning the whole pipeline. We also integrate a modern “flow matching” strategy to enhance image quality, yielding sharper, more realistic results. On a challenging benchmark of 256×256 photographs, FlowAR outperforms previous multi-scale models in both fidelity and diversity. Our code is publicly available, paving the way for easier adoption and future improvements in scalable image synthesis.
Link To Code: https://github.com/OliverRensu/FlowAR
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: Image synthesis and generation
Submission Number: 12306
Loading