REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective
TL;DR: We propose a REINFORCE objective for jailbreaking LLMs that adapts to the model, captures the semantics of the responses, and considers the response distribution.
Abstract: To circumvent the alignment of large language models (LLMs), current optimization-based adversarial attacks usually craft adversarial prompts by maximizing the likelihood of a so-called affirmative response. An affirmative response is a manually designed start of a harmful answer to an inappropriate request. While it is often easy to craft prompts that yield a substantial likelihood for the affirmative response, the attacked model frequently does not complete the response in a harmful manner. Moreover, the affirmative objective is usually not adapted to model-specific preferences and essentially ignores the fact that LLMs output a distribution over responses. If low attack success under such an objective is taken as a measure of robustness, the true robustness might be grossly overestimated. To alleviate these flaws, we propose an adaptive and semantic optimization problem over the population of responses. We derive a generally applicable objective via the REINFORCE policy-gradient formalism and demonstrate its efficacy with the state-of-the-art jailbreak algorithms Greedy Coordinate Gradient (GCG) and Projected Gradient Descent (PGD). For example, our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2\% to 50\% with circuit breaker defense.
Lay Summary: Current methods to trick Large Language Models (LLMs) into saying harmful things often don't work well. They try to make the LLM start with a specific "bad" phrase, but the LLM frequently stops or changes its tune before finishing a genuinely harmful response. This makes us wrongly believe LLMs are safer than they truly are because these attack methods are too simplistic and don't adapt. Thus, we've developed a smarter attack strategy. Instead of just focusing on a fixed starting phrase, our method (using a technique called REINFORCE) looks at the meaning of the LLM's entire range of possible answers and adapts the attack specifically to the model it's targeting. Our new approach significantly boosts the success of these "jailbreak" attacks. For instance, it doubled the attack success rate on Llama3 and made attacks on defended models jump from 2% to 50% effective. By showing how LLMs can be more effectively compromised, our research provides a truer picture of their vulnerabilities, which is crucial for building genuinely safer AI systems.
Link To Code: https://github.com/sigeisler/reinforce_attacks
Primary Area: Deep Learning->Robustness
Keywords: Adversarial attacks, generative models, large language models, jailbreak, reinforce, reinforcement learning
Flagged For Ethics Review: true
Submission Number: 10144
Loading