Sketch First, Scale Fast: On Efficient Inference-Time Scaling for Visual Generation

ICLR 2026 Conference Submission19907 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: text-to-image, inference time scaling, image generation
TL;DR: We propose an efficient inference time scaling framework SF2 that enables estimators to make decision in the early stage.
Abstract: Diffusion models have exhibited exceptional capabilities in image generation tasks. However, due to their inherent stochasticity, the quality of the generated images varies across different settings and may not always be high-fidelity and accurately align with user requirements. To address this challenge, recent works begin to focus on enhancing human preference alignment and improving overall image fidelity during the inference process with inference time scaling, which involves generating multiple candidates through repeated sampling and selecting with predefined metrics. Although effective, it will introduce considerable computational extra costs with the redundant sampling steps. To overcome these limitations, we propose a novel inference time scaling framework SF2 that enables estimators to make decisions in the early step via Co-Fusion pipeline, significantly improving the whole process while maintaining the quality of the selected images. In addition, for the continue generation part, we propose vision reflection mechanism to further align and correct the images with user requirements. Numerous experiments demonstrate that our proposed method can achieve comparable performance while 2.2 x and 2.0 x accelerating the whole inference time scaling process on both Stable-Diffusion-3.5-Large and FLUX.1-dev models.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 19907
Loading