everyone
since 13 Oct 2023">EveryoneRevisionsBibTeX
Text-to-image (T2I) research has grown explosively in the past year, owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches. Yet, \textbf{one pain point persists: the text prompt engineering}, and searching high-quality text prompts for customized results is more art than science. Moreover, as commonly argued: an image is worth a thousand words" - the attempt to describe a desired image with texts often ends up being ambiguous and cannot comprehensively cover delicate visual details, hence necessitating more additional controls from the visual domain. In this paper, we take a bold step forward: taking “Text” out of a pre-trained T2I diffusion model, to reduce the burdensome prompt engineering efforts for users. Our proposed framework, \textbf{Prompt-Free Diffusion}, relies on \textbf{only visual inputs to generate new images}: it takes a reference image as
context”, an optional image structural conditioning, and an initial noise, with absolutely no text prompt. The core architecture behind the scene is \textbf{Se}mantic Context \textbf{E}n\textbf{coder} (\textbf{SeeCoder}), substituting the commonly used CLIP-based or LLM-based text encoder. The reusability of SeeCoder also makes it a convenient drop-in component: one can also pre-train a SeeCoder in one T2I model and reuse it for another. Through extensive experiments, Prompt-Free Diffusion is experimentally found to (i) outperform prior exemplar-based image synthesis approaches; (ii) perform on par with state-of-the-art T2I models using prompts following the best practice; and (iii) be naturally extensible to other downstream applications such as anime figure generation and virtual try-on, with promising quality. Our code and models will be open-sourced.