DPO-Diff: On Discrete Prompt Optimization of Text-to-Image Diffusion Models

Ruochen Wang; Ting Liu; Cho-Jui Hsieh; Boqing Gong

DPO-Diff: On Discrete Prompt Optimization of Text-to-Image Diffusion Models

Ruochen Wang, Ting Liu, Cho-Jui Hsieh, Boqing Gong

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Diffusion Model, Foundation Model, Multimodal, Text-to-Image Generation, Prompt Optimization

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: This paper introduces the first gradient-based framework for prompt optimization in text-to-image diffusion models. We formulate prompt engineering as a discrete optimization problem over the language space. Two major challenges arise in efficiently finding a solution to this problem: 1) Enormous Domain Space: Setting the domain to the entire language space poses significant difficulty to the optimization process. 2) Text Gradient: Computing the text gradient incurs prohibitively high memory-runtime complexity, as it requires backpropagating through all inference steps of the diffusion model. Beyond the problem formulation, our main technical contributions lie in solving the above challenges. First, we design a family of dynamically generated compact subspaces comprised of only the most relevant words to user input, substantially restricting the domain space. Second, we introduce "Shortcut Gradient" --- an effective replacement for the text gradient that can be obtained with constant memory and runtime. Empirical evaluation on prompts collected from diverse sources (DiffusionDB, ChatGPT, COCO) suggests that our method can discover prompts that substantially improve (prompt enhancement) or destroy (adversarial attack) the faithfulness of images generated by the text-to-image diffusion model.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3952

Loading