Prompt-Independent Safe Decoding to Restrain Unsafe Image Generation for Text-to-Image Models against White-Box Adversary

Shengyuan Pang; Jiangyi Deng; Jialin Wu; Yanjiao Chen; Huanlong Zhong; Xinfeng Li; Jie Zhang; Wenyuan Xu

Prompt-Independent Safe Decoding to Restrain Unsafe Image Generation for Text-to-Image Models against White-Box Adversary

Shengyuan Pang, Jiangyi Deng, Jialin Wu, Yanjiao Chen, Huanlong Zhong, Xinfeng Li, Jie Zhang, Wenyuan Xu

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text-to-image generation, AI security, Model compliance

Abstract: Text-to-image (T2I) models, developed through extensive training, are capable of generating realistic images from textual inputs, profoundly influencing various facets of our lives. Nevertheless, they can be exploited by adversaries who input malicious prompts, leading to the creation of unsafe content and posing serious ethical concerns. Current defense mechanisms primarily rely on external moderation or model modification, but they are inherently fragile against white-box adversaries who have access to the model's weights and can adjust them accordingly. To address this issue, we propose \sys, a novel defense framework that governs both the diffusion and the decoder module of the text-to-image pipeline, enabling them to reject generating unsafe content and resist malicious fine-tuning attempts. Concretely, we first fine-tune the diffusion and the decoder module with the denial-of-service samples: 1) for the diffusion module, the inputs are unsafe image-caption pairs, the ground truth is zero predicted noise, and 2) for the decoder module, the inputs are unsafe generations from the diffusion, the ground truth is zero decoding. Then, we employ adversarial training to ensure this denial-of-service behavior for unsafe queries remains effective even after the adversary's fine-tuning with unsafe data. Specifically, we continuously simulate potential fine-tuning processes that the adversary might adopt and expose them to the model, enabling it to learn how to resist. Extensive experiments validate that \sys effectively prevents the generation of unsafe content without compromising the model’s normal performance. Furthermore, our method demonstrates robust resistance to malicious fine-tuning by white-box adversaries, rendering it resource-intensive to corrupt our protected model, thus significantly deterring the misuse of our model for nefarious purposes.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 14032

Loading