An Adversarial Training Approach to Robustify Stable Diffusion Systems Against Prompting Attacks

An Adversarial Training Approach to Robustify Stable Diffusion Systems Against Prompting Attacks

TMLR Paper3957 Authors

13 Jan 2025 (modified: 26 Jun 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Text-to-Image (T2I) systems are generative models designed to generate images based on textual descriptions. Despite their remarkable performance, it has been shown that they are susceptible to misuse. One form of misuse involves manipulating the input prompt, leading to images that do not match with the given description. To address this, we introduce an adversarial training (AT) procedure for Stable Diffusion. Our aim is to train the model across various concepts (e.g., ``bicycle''), ensuring that the output aligns with the original concept even under adversarial modifications (e.g., ``bicycle MJZM4''). To our knowledge, this is the first method to develop an adversarial training approach against this type of misuse. Finally, through several experiments, we demonstrate that the proposed method enhances the robustness of the model against certain classes of prompting attacks.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Xiaochun_Cao3

Submission Number: 3957

Loading