AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation

Published: 27 Feb 2025, Last Modified: 27 Feb 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper studies the vulnerabilities of transformer-based Large Language Models (LLMs) to jailbreaking attacks, focusing specifically on the optimization-based Greedy Coordinate Gradient (GCG) strategy. We first observe a positive correlation between the effectiveness of attacks and the internal behaviors of the models. For instance, attacks tend to be less effective when models pay more attention to system prompts designed to ensure LLM safety alignment. Building on this discovery, we introduce an enhanced method that manipulates models' attention scores to facilitate LLM jailbreaking, which we term AttnGCG. Empirically, AttnGCG shows consistent improvements in attack efficacy across diverse LLMs, achieving an average increase of ~7% in the Llama-2 series and ~10% in the Gemma series. Our strategy also demonstrates robust attack transferability against both unseen harmful goals and black-box LLMs like GPT-3.5 and GPT-4. Moreover, we note our attention-score visualization is more interpretable, allowing us to gain better insights into how our targeted attention manipulation facilitates more effective jailbreaking. We release the code at https://github.com/UCSC-VLAA/AttnGCG-attack.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1. **Add AutoDAN as a new baseline** (see revised Table 1) to better demonstrate the effectiveness of AttnGCG. 2. **Add more current generation models** (see revised Table 1) and test the attack performance of AttnGCG, GCG, and AutoDAN on models such as Gemma-2-9B-it, Mistral-7B-Instruct-v0.3, Llama-3.1-8B-Instruct, and Llama-3.2-3B-Instruct, and compared the results. 3. **Add a discussion of some relevant work on leveraging attention** to enhance attacks in Section 4 - ‘Attacks Involving Attention Mechanisms’.
Code: https://github.com/UCSC-VLAA/AttnGCG-attack
Assigned Action Editor: ~Jiangchao_Yao1
Submission Number: 3767
Loading