Harnessing LLM to Attack LLM-Guarded Text-to-Image Models

Harnessing LLM to Attack LLM-Guarded Text-to-Image Models

ACL ARR 2025 February Submission3608 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: To prevent Text-to-Image (T2I) models from generating unethical images, people deploy safety filters to block inappropriate drawing prompts. Previous works mainly employed token replacement to search adversarial prompts that attempt to bypass these filters, but they has become ineffective as nonsensical tokens fail semantic logical checks. In this paper, we approach adversarial prompts from a different perspective. We demonstrate that rephrasing a drawing intent into multiple benign descriptions of individual visual components can obtain an effective adversarial prompt. We propose a LLM-driven multi-agent method named DACA to automatically complete intended rephrasing. Our method successfully bypasses the safety filters of DALL·E 3 and Midjourney to generate the intended images, achieving success rates of up to 76.7\% and 64\% in the one-time attack, and 98\% and 84\% in the re-use attack, respectively. We open-source our code and dataset on https://github.com/researchcode001/daca.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: ethical considerations in NLP applications

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English, Chinese

Submission Number: 3608

Loading