Keywords: Autoregressive Image Generation, Test-time Scaling
Abstract: Test-time scaling (TTS) has demonstrated remarkable success in enhancing large language models, yet its application to next-token prediction (NTP) autoregressive (AR) image generation remains largely uncharted. Existing TTS approaches for visual AR (VAR), which rely on frequent partial decoding and external reward models, are ill-suited for NTP-based image generation due to the inherent incompleteness of intermediate decoding results. To bridge this gap, we introduce **ScalingAR**, the first TTS framework specifically designed for NTP-based AR image generation that eliminates the need for early decoding or auxiliary rewards. ScalingAR leverages *token entropy* as a novel signal in visual token generation and operates at two complementary scaling levels: (***i***) ***Profile Level***, which streams a calibrated confidence state by fusing intrinsic and conditional signals; and (***ii***) ***Policy Level***, which utilizes this state to adaptively terminate low-confidence trajectories and dynamically schedule guidance for phase-appropriate conditioning strength. Experiments on both general and compositional benchmarks show that ScalingAR **(1)** improves base models by 12.5% on GenEval and 15.2% on TIIF-Bench, **(2)** efficiently reduces visual token consumption by 62.0% while outperforming baselines, and **(3)** successfully enhances robustness, mitigating performance drops by 26.0% in challenging scenarios.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 346
Loading