AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion Models

Yaopei Zeng; Yuanpu Cao; Bochuan Cao; Yurui Chang; Jinghui Chen; Lu Lin

AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion Models

Yaopei Zeng, Yuanpu Cao, Bochuan Cao, Yurui Chang, Jinghui Chen, Lu Lin

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent advances in diffusion models have significantly enhanced the quality of image synthesis, yet they have also introduced serious safety concerns, particularly the generation of Not Safe for Work (NSFW) content. Previous research has demonstrated that adversarial prompts can be used to generate NSFW content. However, such adversarial text prompts are often easily detectable by text-based filters, limiting their efficacy. In this paper, we expose a previously overlooked vulnerability: adversarial image attacks targeting Image-to-Image (I2I) diffusion models. We propose AdvI2I, a novel framework that manipulates input images to induce diffusion models to generate NSFW content. By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms, such as Safe Latent Diffusion (SLD), without altering the text prompts. Furthermore, we introduce AdvI2I-Adaptive, an enhanced version that adapts to potential countermeasures and minimizes the resemblance between adversarial images and NSFW concept embeddings, making the attack more resilient against defenses. Through extensive experiments, we demonstrate that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards, highlighting the urgent need for stronger security measures to address the misuse of I2I diffusion models.

Lay Summary: Diffusion models can edit images with given instructions, but they can also be misused to generate harmful content, such as violent or explicit imagery. While previous research focused on blocking harmful text prompts, we discovered a new vulnerability: these models can also be manipulated through wrought images. Our method, called AdvI2I, subtly edits an image so that the model creates inappropriate results—even when the text looks safe. These attacks bypass existing safety filters and can target widely used models. This research highlights both a critical security risk and a new direction for making AI image generation safer.

Link To Code: https://github.com/Spinozaaa/AdvI2I

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: Diffusion model, Adversarial attack

Submission Number: 7980

Loading