FarsightAlign: Early-Stage Test-Time Scaling for Prompt-Aligned Text-to-Image Generation

18 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: text-to-image, diffusion model, test time scaling
TL;DR: We propose FarsightAlign TTS, a test-time scaling method that uses early cross-attention signals to efficiently select semantically aligned candidates in text-to-image diffusion.
Abstract: Text-to-Image diffusion models have achieved remarkable progress under the guidance of "Scaling Laws", but further performance gains are increasingly hindered by diminishing returns from scaling model size and data volume. To bypass this bottleneck, Test-Time Scaling (TTS) has emerged as a promising alternative. However, the lack of interpretable signals in the early diffusion steps forces existing TTS approaches to perform nearly complete denoising process for every candidate—resulting in high computational cost. In this work, we propose FarsightAlign TTS, a novel and efficient inference-time framework that leverages the rich semantic signals embedded in early cross-attention maps. With just a few denoising steps, FarsightAlign TTS can extract structured semantic information, such as object presence, layout, and attributes. It then leverages a lightweight scorer to prune unaligned candidates before committing to the final generation. This design significantly reduces computational cost while improving alignment with the user's prompt. The experimental results demonstrate the effectiveness of our method. Furthermore, FarsightAlign TTS can function as a plug-and-play module, significantly boosting the semantic alignment capabilities of other advanced TTS frameworks with minimal additional computational overhead.
Primary Area: generative models
Submission Number: 10068
Loading