Control and Realism: Best of Both Worlds in Layout-to-Image without Training

Bonan Li; Yinhan Hu; Songhua Liu; Xinchao Wang

Control and Realism: Best of Both Worlds in Layout-to-Image without Training

Bonan Li, Yinhan Hu, Songhua Liu, Xinchao Wang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: In this paper, a novel training-free method based on the diffusion model, WinWinLay, is proposed by revisiting attention backward guidance and introducing modifications to tackle existing drawbacks.

Abstract: Layout-to-Image generation aims to create complex scenes with precise control over the placement and arrangement of subjects. Existing works have demonstrated that pre-trained Text-to-Image diffusion models can achieve this goal without training on any specific data; however, they often face challenges with imprecise localization and unrealistic artifacts. Focusing on these drawbacks, we propose a novel training-free method, WinWinLay. At its core, WinWinLay presents two key strategies—Non-local Attention Energy Function and Adaptive Update—that collaboratively enhance control precision and realism. On one hand, we theoretically demonstrate that the commonly used attention energy function introduces inherent spatial distribution biases, hindering objects from being uniformly aligned with layout instructions. To overcome this issue, non-local attention prior is explored to redistribute attention scores, facilitating objects to better conform to the specified spatial conditions. On the other hand, we identify that the vanilla backpropagation update rule can cause deviations from the pre-trained domain, leading to out-of-distribution artifacts. We accordingly introduce a Langevin dynamics-based adaptive update scheme as a remedy that promotes in-domain updating while respecting layout constraints. Extensive experiments demonstrate that WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.

Lay Summary: In this paper, a novel training-free method based on the diffusion model, WinWinLay, is proposed by revisiting attention backward guidance and introducing modifications to tackle existing drawbacks.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Primary Area: Applications->Computer Vision

Keywords: Layout-to-Image generation; Training-free

Submission Number: 1021

Loading