Decoder-Only LLMs are Better Controllers for Diffusion Models

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Groundbreaking advancements in text-to-image generation have recently been achieved with the emergence of diffusion models. These models exhibit a remarkable ability to generate highly artistic and intricately detailed images based on textual prompts. However, obtaining desired generation outcomes often necessitates repetitive trials of manipulating text prompts just like casting spells on a magic mirror, and the reason behind that is the limited capability of semantic understanding inherent in current image generation models. Specifically, existing diffusion models encode the input text prompt with a pre-trained encoder structure, which is usually trained on a limited amount of image-caption pairs. State-of-the-art large language models (LLMs) based on the decoder-only structure have shown very powerful semantic understanding capability as their architectures are more suitable for training on very large-scale unlabeled data. In this work, we propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models (LLMs), resulting in a simple yet effective adapter to allow the diffusion models to be compatible with the decoder-only structure. In the evaluation, we conduct not only extensive empirical results but also the supporting theoretical analysis with various architectures (e.g., encoder-only, encoder-decoder, and decoder-only). The experimental results show that the enhanced models with our adapter module are superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This work provides a novel text-based multimedia generation approach based on large language models that is more controllable and can better understand the user's intent.
Submission Number: 1663
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview