AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models

Published: 23 Oct 2023, Last Modified: 28 Nov 2023SoLaR PosterEveryoneRevisionsBibTeX
Keywords: large language models, llms, adverarial attacks, jailbreak attacks, llm security, adverarial robustness
TL;DR: We propose an interpretable adversarial attack on large language models that can automatically generate readable, strategic, and transferable prompts, with capabilities to achieve other jailbreak goals like prompt leaking.
Abstract: Large Language Models (LLMs) exhibit broad utility in diverse applications but remain vulnerable to jailbreak attacks, including hand-crafted and automated adversarial attacks, which can compromise their safety measures. However, recent work suggests that patching LLMs against these attacks is possible: manual jailbreak attacks are human-readable but often limited and public, making them easy to block, while automated adversarial attacks generate gibberish prompts that can be detected using perplexity-based filters. In this paper, we propose an interpretable adversarial attack, \texttt{AutoDAN}, that combines the strengths of both types of attacks. It automatically generates attack prompts that bypass perplexity-based filters while maintaining a high attack success rate like manual jailbreak attacks. These prompts are interpretable, exhibiting strategies commonly used in manual jailbreak attacks. Moreover, these interpretable prompts transfer better than their non-readable counterparts, especially when using limited data or a single proxy model. Beyond eliciting harmful content, we also customize the objective of \texttt{AutoDAN} to leak system prompts, demonstrating its versatility. Our work underscores the seemingly intrinsic vulnerability of LLMs to interpretable adversarial attacks.
Submission Number: 75
Loading