Modifier unlocked: Jailbreaking text-to-image models through prompts
Abstract: The unprecedented image generation capability of text-to-image models makes them double-edged swords. While these models allow users to create exquisite images through simple prompts, they also provide adversaries with opportunities to generate Not-Safe-for-Work (NSFW) content, referred to as the jailbreak attack. Despite built-in safety filters serving as a mitigation, their vulnerabilities and associated safety issues remain a significant concern. In this work, we propose MODX, the first modifier-based attack framework for jailbreaking text-to-image models. Modx leverages a heuristic algorithm with two heuristic functions (constraints) to identify modifiers that adjust the artistic genre to subtly introduce unsafe elements that drive the generated images towards NSFW. This approach takes advantage of the fact that filters are unlikely to reject images in certain styles or artistic forms, effectively inducing the models to generate NSFW content. We demonstrate the feasibility of modifier-based jailbreaking with a theoretical analysis, and provide experimental evidence of the effectiveness of MODX. Our results show that MODX outperforms existing methods in successfully achieving jailbreaking across four state-of-the-art text-to-image models. Moreover, we evaluate MODX across additional NSFW categories and on more models or model versions, demonstrating its strong scalability and generalization. Disclaimer: This paper contains NSFW language and imagery that could be offensive, distressing, and/or upsetting. Reader discretion is advised.
Loading