SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution

Published: 01 Jan 2024, Last Modified: 27 Jan 2025CCS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Advanced text-to-image models such as DALL⋅E 2, Midjourney, and Stable Diffusion can generate highly realistic images, raising significant concerns regarding the potential proliferation of unsafe content. This includes adult, violent, or deceptive imagery of political figures. Despite claims of rigorous safety mechanisms implemented in these models to restrict the generation of Not-Safe-For-Work (NSFW) content, we successfully devise and exhibit the first prompt attacks on Midjourney, producing abundant photorealistic NSFW images. We reveal the fundamental principles of such prompt attacks and strategically substitute high-risk sections within a suspect prompt to evade closed-source safety measures. Our novel framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models and image-to-text modules to automate attack prompt creation at scale. Evaluation results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts, leading to counterfeit images depicting political figures in violent scenarios with high probability. We also demonstrate attacks generating explicit adult-themed imagery. Both subjective and objective assessments validate that the images generated from our attack prompts present considerable safety hazards.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview