AttnGCG: Enhancing Adversarial Attacks on Language Models with Attention Manipulation

14 May 2024 (modified: 06 Nov 2024)Submitted to NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: AttnGCG, Adversarial Attacks, Attention Mechanism, Optimization-based Attacks
Abstract: This paper studies the vulnerabilities of transformer-based Large Language Models (LLMs) to jailbreaking attacks, with a particular focus on the optimization-based Greedy Coordinate Gradient (GCG) strategy. Noting a positive correlation between the effectiveness of attacks and the internal behaviors of models---for instance, attacks are less effective when models robustly focus on system instructions specialized for mitigating harmful behaviors and ensuring safety alignment---we introduce an enhanced method that additionally manipulates models’ attention scores to enhance the large language model (LLM) jailbreaking. We term this novel strategy AttnGCG. Empirically, AttnGCG demonstrates consistent performance enhancements across diverse LLMs, with an average improvement of 7\% in the Llama-2 series and 10\% in the Gemma series. This strategy also exhibits stronger attack transferability when testing on unknown or closed-sourced LLMs, such as GPT-3.5 and GPT-4. Moreover, we show that AttnGCG is able to offer enhanced interpretability by visualizing models' attention scores across different input components, thus providing clear insights into how targeted attention manipulation contributes to more successful jailbreaking.
Primary Area: Safety in machine learning
Submission Number: 10747
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview