Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-image Diffusion Models!

Published: 12 Mar 2025, Last Modified: 12 Mar 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Text-to-image diffusion models such as Stable Diffusion and DALL-E have exhibited impressive capabilities in producing high-quality, diverse, and realistic images based on textual prompts. Nevertheless, a common issue arises where these models encounter difficulties in faithfully generating every entity specified in the prompt, leading to a recognized challenge known as entity missing in visual compositional generation. While previous studies indicated that actively adjusting cross-attention maps during inference could potentially resolve the issue, there has been a lack of systematic investigation into the specific objective function required for this task. In this work, we thoroughly investigate three potential causes of entity missing from the perspective of cross-attention maps: insufficient attention intensity, excessive attention spread, and significant overlap between attention maps of different entities. Through comprehensive empirical analysis, we found that optimizing metrics that quantify the overlap between attention maps of entities is highly effective at mitigating entity missing. We hypothesize that during the denoising process, entity-related tokens engage in a form of competition for attention toward specific regions through the cross-attention mechanism. This competition may result in the attention of a spatial location being divided among multiple tokens, leading to difficulties in accurately generating the entities associated with those tokens. Building on this insight, we propose four overlap-based loss functions that can be used to implicitly manipulate the latent embeddings of the diffusion model during inference: Intersection over union (IoU), center-of-mass (CoM) distance, Kullback–Leibler (KL) divergence, and clustering compactness (CC). Extensive experiments on a diverse set of prompts demonstrate that our proposed training-free methods substantially outperform previous approaches on a range of compositional alignment metrics, including visual question-answering, captioning score, CLIP similarity, and human evaluation. Notably, our method outperforms the best baseline by $9\%$ in human evaluation.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: **Changes since last submission:** **Section 1 (Introduction):** To address the concern regarding other compositional generation challenges raised by Reviewers 9HB8 and jij7, a new sentence has been added to the penultimate paragraph, describing experiments on compositional generation problems such as incorrect attribute binding, incorrect spatial relationship, and numeracy-related issues. **Subsection 2.1 (Visual Compositional Generation):** As requested by Reviewer 9HB8, a new citation is added in the second paragraph. **Subsection 2.2 (Training-Free Methods for Compositional Generation):** As requested by Reviewer jij7, a new paragraph has been added at the beginning of the subsection, providing a description of fine-tuning-based approaches for mitigating visual compositional generation problems. **Subsection 6.2 (Quantitative Results):** As requested by Reviewers 9HB8 and jij7, two new paragraphs have been added to the revised manuscript: the third paragraph describes experiments involving background entities, while the last paragraph discusses other visual compositional generation challenges. Correspondingly, two new tables (Tables 6 and 8) have also been included. **Subsection 6.3 (Qualitative Results):** A new sentence has been added to the penultimate paragraph of the subsection, addressing the background entities as requested by Reviewer jij7. **Subsection 7 (Conclusion):** To address the concern regarding other compositional generation challenges raised by Reviewers 9HB8 and jij7, a new sentence has been added to the conclusion section, summarizing the results of experiments on compositional generation problems such as incorrect attribute binding, erroneous spatial relationships, and numeracy-related issues. **References:** Four new references have been added in relation to the new paragraph in Subsection 2.2, as requested by Reviewer jij7. **Section E in Appendix (Additional Qualitative Results):** As requested by Reviewers jij7 and q34m, two new figures (Figures 8 and 9) have been added to demonstrate the qualitative results of experiments involving background entities and free-style prompts, respectively.
Code: https://github.com/sharif-ml-lab/entity_missing
Supplementary Material: zip
Assigned Action Editor: ~Changyou_Chen1
Submission Number: 3520
Loading