Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Text-to-Image, Diffusion Models, Image-Variation, Generative Models, High-Quality Image Generation, Representation Learning
TL;DR: We proposed Prompt-Free Diffusion empowered with a new visual encoder SeeCoder, using which pretrained T2I diffusion models can now synthesis high-quality images using reference image inputs and replace the need of text prompt.
Abstract: Text-to-image (T2I) research has grown explosively in the past year, owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches. Yet, \textbf{one pain point persists: the text prompt engineering}, and searching high-quality text prompts for customized results is more art than science. Moreover, as commonly argued: ``an image is worth a thousand words" - the attempt to describe a desired image with texts often ends up being ambiguous and cannot comprehensively cover delicate visual details, hence necessitating more additional controls from the visual domain. In this paper, we take a bold step forward: taking “Text” out of a pre-trained T2I diffusion model, to reduce the burdensome prompt engineering efforts for users. Our proposed framework, \textbf{Prompt-Free Diffusion}, relies on \textbf{only visual inputs to generate new images}: it takes a reference image as ``context”, an optional image structural conditioning, and an initial noise, with absolutely no text prompt. The core architecture behind the scene is \textbf{Se}mantic Context \textbf{E}n\textbf{coder} (\textbf{SeeCoder}), substituting the commonly used CLIP-based or LLM-based text encoder. The reusability of SeeCoder also makes it a convenient drop-in component: one can also pre-train a SeeCoder in one T2I model and reuse it for another. Through extensive experiments, Prompt-Free Diffusion is experimentally found to (i) outperform prior exemplar-based image synthesis approaches; (ii) perform on par with state-of-the-art T2I models using prompts following the best practice; and (iii) be naturally extensible to other downstream applications such as anime figure generation and virtual try-on, with promising quality. Our code and models will be open-sourced.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6238
Loading