Abstract: Text-to-Image (T2I) systems are generative models designed to generate images based on textual descriptions. Despite their remarkable performance, it has been shown that they are susceptible to misuse. One form of misuse involves manipulating the input prompt, leading to images that do not match with the given description. To address this, we introduce an adversarial training (AT) procedure for Stable Diffusion. Our aim is to train the model across various concepts (e.g., ``bicycle''), ensuring that the output aligns with the original concept even under adversarial modifications (e.g., ``bicycle MJZM4''). To our knowledge, this is the first method to develop an adversarial training approach against this type of misuse. Finally, through several experiments, we demonstrate that the proposed method enhances the robustness of the model against certain classes of prompting attacks.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Xiaochun_Cao3
Submission Number: 3957
Loading