AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models

Shilong Pan, Zhiliang Tian, Zhen Huang, Wanlong Yu, Zhihua Wen, Xinwang Liu, Kai Lu, Minlie Huang, Dongsheng Li

Published: 2025, Last Modified: 22 Aug 2025ACL (1) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: LLMs demonstrate remarkable utility but remain vulnerable to jailbreak attacks that aim to elicit harmful responses. Existing defenses, including post-training alignment and prompt engineering, rely on training on safety-annotated datasets and safe prompt templates, struggling with adaptability to out-of-distribution (OOD) attacks. Steering internal representations of LLMs provides real-time adjustments to defend against OOD attacks. However, it struggles with maintaining model utility, since modifying the representation disrupts the forward pass of inference. It barely considers the competitive objectives of helpfulness and harmlessness in LLMs. We argue that adversarial game-based approaches promise a solution for conflicts between the two objectives. In this paper, we propose **A**dversarial **G**ame **D**efense (AGD), an adversarial game-based defense method that dynamically adjusts LLMs’ internal representations to achieve a balanced trade-off between helpfulness and harmlessness. AGD first proposes an interquartile range (IQR) method to detect abnormal attention weights and correct the abnormal weights via adversarial training. AGD adopts a bi-level optimization to play a two-player variable-sum game to approach Nash Equilibrium (NE), where the two players adversarially refine head activations for helpfulness and harmlessness respectively. Furthermore, AGD applies an expert model to next-token sampling to generate safer responses. Experiments show that AGD significantly improves LLMs’ safety over all baselines.

External IDs:dblp:conf/acl/PanTHYWLLH025