ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation

18 Sept 2025 (modified: 20 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: text-to-image diffusion models, attention control, multi-instance generation
Abstract: Text-to-image diffusion models excel at synthesizing single objects but frequently fail in multi-instance scenes, producing merged or missing objects. We show that this limitation arises because instance structures emerge before semantic features during denoising, making early semantic guidance unreliable. To address this, we propose \textbf{I}nstance-to-\textbf{S}emantic \textbf{A}ttention \textbf{C}ontrol (ISAC), a training-free and hierarchical inference objective that first enforces non-overlapping instance formation with self-attention and then aligns semantics through cross-attention. ISAC introduces a maximum pixel-wise overlap (MPO) criterion to strictly decouple instances and can be applied either as latent optimization or latent selection. Experiments on T2I-CompBench, HRS-Bench, and a new similar-object benchmark show that ISAC substantially improves both multi-class and multi-instance fidelity, achieving up to 52\% multi-class accuracy and 83\% multi-instance accuracy without external supervision. Our findings highlight the importance of aligning control with diffusion dynamics for faithful and scalable multi-object generation. The code will be made available upon publication.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 11255
Loading