Don’t Scale It All: Training-Free Localized Test-Time Scaling for Diffusion Models

07 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion models, Test-time scaling, Training-free
Abstract: Diffusion models have become the core paradigm for high-fidelity image generation, achieving remarkable performance in tasks such as text-to-image synthesis. A common strategy to further boost their performance is test-time scaling (TTS), which improves generation quality by allocating more computation during inference. Despite recent progress, existing TTS methods operate at the full-image level, neglecting the fact that image quality is often spatially heterogeneous. As a result, they squander computation on already satisfactory regions while failing to target localized defects, leading to both inefficiency and instability. In this paper, we advocate a new direction - Localized TTS - that adaptively resamples defective regions while preserving high-quality areas, thereby substantially reducing the search space. This paradigm promises to improve efficiency and stability, but poses two central challenges: accurately localizing defects and maintaining global consistency. We propose $\textbf{LoTTS}$, the first fully training-free framework for localized TTS. For defect localization, LoTTS detects defective regions by contrasting cross-/self-attention signals under quality-aware prompts (e.g., ''high-quality'' vs. ''low-quality''), reweights them with original prompt attention to filter out irrelevant background, and refines them with self-attention propagation to ensure spatial coherence. For consistency, LoTTS perturbs low-quality regions with noise at intermediate timesteps for localized resampling, and then performs a few global denoising steps to seamlessly couple local corrections with the overall structure and style. Extensive experiments on SD2.1, SDXL, and FLUX demonstrate that LoTTS achieves state-of-the-art performance:~it consistently improves both local quality and global fidelity, while reducing GPU cost by $2$–$4\times$ compared to Best-of-$N$ sampling. These findings establish localized TTS as a promising new direction for scaling diffusion models at inference time.
Primary Area: generative models
Submission Number: 2821
Loading