EGGen: Image Generation with Multi-entity Prior Learning through Entity Guidance

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Diffusion models have shown remarkable prowess in text-to-image synthesis and editing, yet they often stumble when tasked with interpreting complex prompts that describe multiple entities with specific attributes and interrelations. The generated images often contain inconsistent multi-entity representation (IMR), reflected as inaccurate presentations of the multiple entities and their attributes. Although providing spatial layout guidance improves the multi-entity generation quality in existing works, it is still challenging to handle the leakage attributes and avoid unnatural characteristics. To address the IMR challenge, we first conduct in-depth analyses of the diffusion process and attention operation, revealing that the IMR challenges largely stem from the process of cross-attention mechanisms. According to the analyses, we introduce the entity guidance generation mechanism, which maintains the integrity of the original diffusion model parameters by integrating plug-in networks. Our work advances the stable diffusion model by segmenting comprehensive prompts into distinct entity-specific prompts with bounding boxes, enabling a transition from multi-entity to single-entity generation in cross-attention layers. More importantly, we introduce entity-centric cross-attention layers that focus on individual entities to preserve their uniqueness and accuracy, alongside global entity alignment layers that refine cross-attention maps using multi-entity priors for precise positioning and attribute accuracy. Additionally, a linear attenuation module is integrated to progressively reduce the influence of these layers during inference, preventing oversaturation and preserving generation fidelity. Our comprehensive experiments demonstrate that this entity guidance generation enhances existing text-to-image models in generating detailed, multi-entity images.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This work directly contributes to the theme of "Multimedia in the Generative AI Era" through the integration of a novel entity guidance generation (EGGen) mechanism with diffusion models. Our approach enhances the alignment between the text-conditioned prompts and visual modalities, specifically targeting the precision and fidelity of multi-entity image synthesis. By implementing entity-specific processing within cross-attention layers and introducing methods like ECA, GEA, and LA, our model not only improves the generation of multimedia content with high realism and diversity but also paves the way for interactive and personalized multimedia applications. This aligns with the conference's focus on innovative techniques that elevate the capabilities of generative AI in multimedia systems.
Supplementary Material: zip
Submission Number: 834
Loading