Abstract: Jailbreak attacks pose significant security threats to large language models (LLMs), enabling them to generate content that violates various moderation policies. Several jailbreak defenses have been proposed to mitigate this risk. However, the effectiveness of these attacks and defenses varies under different policies due to semantic differences among them. Existing research on jailbreak attacks and defenses overlooks this factor, limiting a deeper understanding of LLM robustness. In this paper, we introduce a policy-aware jailbreak defense framework called PolicyGuard consisting of two parts: a policy classification component and a jailbreak mitigation component. The former utilizes the concept analysis method to assess whether a given prompt is harmful and to identify the specific policy it violates, such as privacy invasion. The latter leverages prompt tuning to modify the input prompts, ensuring that the model generates non-harmful outputs. Our experimental results demonstrate that PolicyGuard achieves a policy classification accuracy of 85%, significantly surpassing the state-of-the-art which reaches an accuracy of only 72%. Based on the high classification accuracy, we achieve an average defense success rate of 97% against various jailbreak attacks, which makes an improvement of over 10% compared to prior approaches.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/unfairness mitigation; ethical considerations in NLP applications; policy and governance;
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 6190
Loading