An Adversarial Training Approach to Robustify Stable Diffusion Systems Against Prompting Attacks

TMLR Paper3957 Authors

13 Jan 2025 (modified: 04 Mar 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Text-to-Image (T2I) systems are generative models designed to generate images based on textual descriptions. Despite their remarkable performance, it has been shown that they are susceptible to misuse. One form of misuse involves manipulating the input prompt, leading to images that do not match with the given description. To address this, we introduce an adversarial training (AT) procedure for Stable Diffusion. Our aim is to train the model across various concepts (e.g., ``bicycle''), ensuring that the output aligns with the original concept even under adversarial modifications (e.g., ``bicycle MJZM4''). To our knowledge, this is the first method to develop an adversarial training approach against this type of misuse. Finally, through several experiments, we demonstrate that the proposed method enhances the robustness of the model against certain classes of prompting attacks.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Xiaochun_Cao3
Submission Number: 3957
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview