Natural Language Induced Adversarial Images

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

Research of adversarial attacks is important for AI security because it shows the vulnerability of deep learning models and helps to build more robust models. Adversarial attacks on images are most widely studied, which includes noise-based attacks, image editing-based attacks, and latent space-based attacks. However, the adversarial examples crafted by these methods often lack sufficient semantic information, making it challenging for humans to understand the failure modes of deep learning models under natural conditions. To address this limitation, we propose a natural language induced adversarial image attack method. The core idea is to leverage a text-to-image model to generate adversarial images given input prompts, which are maliciously constructed to lead to misclassification for a target model. To adopt commercial text-to-image models for synthesizing more natural adversarial images, we propose an adaptive genetic algorithm (GA) for optimizing discrete adversarial prompts without requiring gradients and an adaptive word space reduction method for improving the query efficiency. We further used CLIP to maintain the semantic consistency of the generated images. In our experiments, we found that some high-frequency semantic information such as "foggy'', "humid'', "stretching'', etc. can easily cause classifier errors. These adversarial semantic information exist not only in generated images, but also in photos captured in the real world. We also found that some adversarial semantic information can be transferred to unknown classification tasks. Furthermore, our attack method can transfer to different text-to-image models (e.g., Midjourney, DALL·E 3, etc.) and image classifiers.

Primary Subject Area: [Generation] Social Aspects of Generative AI
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This work proposes a multimodal adversarial attack method to use natural language to generate adversarial images, which is different from most of the previous adversarial attacks that focus on the single image modality. This multimodal method bridges vision and language, which has rich semantic information and helps humans to analyze adversarial images from a natural language view. Besides, this work reveals the potential safety and fairness issues of current text-to-image models like Midjourney, DALL·E 3, Stable Diffusion, etc. It inspires us to focus on the social aspects of generative AI.
Supplementary Material: zip
Submission Number: 858
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview