AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: societal considerations including fairness, safety, privacy
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: large language models, llms, adverarial attacks, jailbreak attacks, llm security, adverarial robustness
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose an interpretable adversarial attack on large language models that automatically generates readable, strategic prompts, and is capable of leaking system prompts and breaking instructions.
Abstract: Large Language Models (LLMs) exhibit broad utility in diverse applications but remain vulnerable to jailbreak attacks, including hand-crafted and automated adversarial attacks, which can compromise their safety measures. However, patching LLMs against these attacks is possible: manual jailbreak attacks are human-readable but often limited and public, making them easy to block, while automated adversarial attacks generate gibberish prompts that can be detected using perplexity-based filters. In this paper, we propose an automatic and interpretable adversarial attack, AutoDAN, that combines the strengths of both types of attacks. It automatically generates attack prompts that bypass perplexity-based filters while maintaining a high attack success rate like manual jailbreak attacks. These prompts are interpretable, exhibiting strategies commonly used in manual jailbreak attacks. Moreover, these interpretable prompts transfer better than their non-readable counterparts, especially when using limited data and a single proxy model. Beyond eliciting harmful content, we also customize the objective of AutoDAN to leak system prompts, demonstrating its versatility. Our work underscores the seemingly intrinsic vulnerability of LLMs to interpretable adversarial attacks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6675
Loading