AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0
Research Area: Safety
Keywords: Large Language Models, Jailbreak Attack, Adversarial Attack
TL;DR: We propose a universal adversarial suffix generator to generate many customized suffixes for any harmful queries in seconds and achieve near 100% ASR on aligned LLMs. it could also transfers to different LLMs and bypass perplexity detector.
Abstract: \begin{center} \textcolor{red}{Warning: This paper contains potentially offensive and harmful text.}\end{center} As large language models (LLMs) become increasingly prevalent and integrated into autonomous systems, ensuring their safety is imperative. Despite significant strides toward safety alignment, recent work GCG~\citep{zou2023universal} proposes a discrete tokens optimization algorithm and selects the single suffix with the lowest loss to successfully jailbreak aligned LLMs. In this work, we first discuss the drawbacks of solely picking the suffix with the lowest loss during GCG optimization for jailbreaking and uncover the missed successful suffixes during the intermediate steps. Moreover, we utilize those successful suffixes as training data to learn a generative model, named AmpleGCG, which captures the distribution of adversarial suffixes given a harmful query and enables the rapid generation of hundreds of suffixes for any harmful queries in seconds. AmpleGCG achieves near 100\% attack success rate (ASR) on two aligned LLMs (Llama-2-7B-chat and Vicuna-7B), surpassing two strongest attack baselines. More interestingly, AmpleGCG also transfers seamlessly to attack different models, including closed-source LLMs, achieving a 99\% ASR on the latest GPT-3.5. To summarize, our work amplifies the impact of GCG by training a generative model of adversarial suffixes that is universal to any harmful queries and transferable from attacking open-source LLMs to closed-source LLMs. Impressively, it can generate 200 adversarial suffixes for one harmful query in only 4 seconds, rendering it more challenging to defend.
Supplementary Material: zip
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Flagged For Ethics Review: true
Ethics Comments: The techniques can be used to extract responses from the LLM that can cause societal harm in the form of offensive language, encouraging unsafe behavior, etc. The authors have acknowledged these concerns. The creation and dissemination of a tool capable of efficiently generating adversarial attacks on LLMs could be considered ethically questionable, as it may aid malicious actors.
Submission Number: 527
Loading