Breaking Free: Hacking Diffusion Models for Generating Adversarial Examples and Bypassing Safety Guardrails

Shashank Kotyan; Po-Yuan Mao; Pin-Yu Chen; Danilo Vasconcellos Vargas

Breaking Free: Hacking Diffusion Models for Generating Adversarial Examples and Bypassing Safety Guardrails

Shashank Kotyan, Po-Yuan Mao, Pin-Yu Chen, Danilo Vasconcellos Vargas

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Conditioned-Image Synthesis, Natural Adversarial Examples, CMA Evolutionary Strategy Optimization

TL;DR: We propose EvoSeed algorithmic framework, first-of-a-kind evolutionary based adversarial search, to generate high-quality Natural Adversarial Images for classifier models using Conditioned Diffusion Models.

Abstract: Deep neural networks can be exploited using natural adversarial samples, which do not impact human perception. Current approaches often rely on synthetically altering the distribution of adversarial samples compared to the training distribution. In contrast, we propose EvoSeed, a novel evolutionary strategy-based algorithmic framework that uses auxiliary Conditional Diffusion and Classifier models to generate photo-realistic natural adversarial samples. We employ CMA-ES to optimize the initial seed vector search, which, when processed by the Conditional Diffusion Model, results in the natural adversarial sample misclassified by the Classifier Model. Experiments show that generated adversarial images are of high image quality, raising concerns about generating harmful content bypassing safety classifiers. We also show that beyond generating adversarial images, EvoSeed can also be used as a red-teaming tool to understand classification systems' misclassification. Our research opens new avenues for understanding the limitations of current safety mechanisms and the risk of plausible attacks against classifier systems using image generation.

Supplementary Material: zip

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6058

Loading