InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

ICLR 2026 Conference Submission13932 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Models, Spatial Alignment, Inference-Time Guidance

Abstract: Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of CLIP text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Our proposed loss leverages different levels of cross-attention maps extracted from the U-Net decoder to enforce accurate object placement and a balanced object presence during sampling. Our method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive quantitative and qualitative evaluations demonstrate that, on widely adopted spatial benchmarks (VISOR and T2I-CompBench), our approach establishes a new state-of-the-art (to the best of our knowledge), delivering substantial performance gains and even surpassing fine-tuning-based baselines.

Primary Area: generative models

Submission Number: 13932

Loading