DiDI: Disentangle Denoise Inject for Improving T2I Diffusion Models

09 Sept 2025 (modified: 18 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Text-to-image synthesis, Diffusion models, Initial noise optimization, Text-to-image alignment
TL;DR: We propose Disentangle Denoising Inject (DiDI): a training-free pipeline that improves T2I alignment even under poor initializations, by explicitly modifying the initial latent representation.
Abstract: Text-to-image (T2I) synthesis has been revolutionized by diffusion models. However, state-of-the-art (SOTA) like Stable Diffusion still suffers from well-known alignment issues like subject mixing and subject neglect when composing from multi-subject prompts. While recent efforts aim to address these misalignment issues, they remain vulnerable to bad initial noise seeds. We propose Disentangle Denoising Inject (DiDI): a training-free pipeline that improves T2I alignment even under poor initializations, by explicitly modifying the initial latent representation. It is observed that the alignment failures in multi-subject settings, caused by cross-attention overlap and inadequate subject semantics fundamentally stem from a poor initial noise. To this end, DiDI performs spatial disentanglement on the initial latent to mitigate subject mixing. Our core insight is that diffusion models, despite sub-optimal seeds, can still reliably synthesize individual subjects and their attributes. Accordingly, DiDI introduces a partial denoising scheme to generate early semantic features for individual subjects and attributes. By injecting representative features into the disentangled latent, DiDI successfully guides the denoising process towards more faithful generations. Extensive experiments demonstrate the superiority of our method compared to existing SOTA approaches.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 3347
Loading