Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

Fan Liu; Zhao Xu; Hao Liu

Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

Fan Liu, Zhao Xu, Hao Liu

15 Sept 2024 (modified: 18 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: jailbreak defense, jialbreak attack, LLM

Abstract: Although safely enhanced Large Language Models (LLMs) have achieved remarkable success in tackling various complex tasks in a zero-shot manner, they remain susceptible to jailbreak attacks, particularly the unknown jailbreak attack. To enhance LLMs' generalized defense capabilities, we propose a two-stage adversarial tuning framework, which generates adversarial prompts to explore worst-case scenarios by optimizing datasets containing pairs of adversarial prompts and their safe responses. In the first stage, we introduce the hierarchical meta-universal adversarial prompt learning to efficiently and effectively generate token-level adversarial prompts. In the second stage, we propose automatic adversarial prompt learning to iteratively construct out-of-distribution adversarial prompts, further enhancing LLM’s defense capabilities. We conducted comprehensive experiments on three widely used jailbreak datasets, comparing our framework with six defense baselines under five representative attack scenarios. \fan{ Specifically, for the computational efficiency of generating token-level adversarial prompts, we demonstrate both empirically and theoretically that our method achieves approximately a 15x speedup. Additionally, our methods exhibit superior defense performance against both known and unknown jailbreak attacks. Importantly, our adversarial tuning framework shows broad generalizability across various attack strategies and target LLMs (including the large 110B model), highlighting its potential as a transferable defense mechanism.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 878

Loading