ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation

Sanghyun Jo; Wooyeol Lee; Ziseok Lee; Kyungsu Kim

ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation

Sanghyun Jo, Wooyeol Lee, Ziseok Lee, Kyungsu Kim

18 Sept 2025 (modified: 20 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: text-to-image diffusion models, attention control, multi-instance generation

Abstract: Text-to-image diffusion models excel at synthesizing single objects but frequently fail in multi-instance scenes, producing merged or missing objects. We show that this limitation arises because instance structures emerge before semantic features during denoising, making early semantic guidance unreliable. To address this, we propose \textbf{I}nstance-to-\textbf{S}emantic \textbf{A}ttention \textbf{C}ontrol (ISAC), a training-free and hierarchical inference objective that first enforces non-overlapping instance formation with self-attention and then aligns semantics through cross-attention. ISAC introduces a maximum pixel-wise overlap (MPO) criterion to strictly decouple instances and can be applied either as latent optimization or latent selection. Experiments on T2I-CompBench, HRS-Bench, and a new similar-object benchmark show that ISAC substantially improves both multi-class and multi-instance fidelity, achieving up to 52\% multi-class accuracy and 83\% multi-instance accuracy without external supervision. Our findings highlight the importance of aligning control with diffusion dynamics for faithful and scalable multi-object generation. The code will be made available upon publication.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 11255

Loading