AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Research Area: Safety
Keywords: large language models, adversarial robustness, jailbreak attacks, red-teaming, controllable generation
TL;DR: A new gradient-based jailbreak attack on LLMs that can generate coherent and strategic prompts and applicable to various tasks.
Abstract: Red-teaming Large Language Models (LLMs) requires jailbreak attacks that can comprehensively characterize the vulnerabilities of LLMs. Current blackbox attacks are limited by predefined jailbreak strategies, while whitebox attacks can only generate gibberish attack prompts detectable by perplexity filters. In this paper, we propose a new whitebox attack, named AutoDAN, that merges gradient-based token-wise optimization with controllable text generation. AutoDAN can generate coherent attack prompts on various LLMs that bypass any perplexity filter while having high attack success rates. Notably, these attack prompts spontaneously exhibit jailbreak strategies commonly seen in manual jailbreaks, such as hypothetical scenarios and non-English languages, without any prior knowledge of them. These interpretable attack prompts also generalize better to unseen harmful behaviors and transfer better to blackbox LLMs than gibberish ones. Moreover, we apply AutoDAN to two other red-teaming tasks: prompt leaking and generating falsely censored harmless user requests, demonstrating its flexibility over blackbox attacks. Our work offers an additional tool for red-teaming and understanding jailbreak mechanisms via interpretability.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 1289
Loading