How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation

Zhuohan Long; Siyuan Wang; Shujun Liu; Yuhang Lai; Xuanjing Huang; zhongyu wei

How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation

Zhuohan Long, Siyuan Wang, Shujun Liu, Yuhang Lai, Xuanjing Huang, zhongyu wei

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Jailbreak, defense, analysis, LVLM

Abstract: Jailbreak attacks, where malicious prompts bypass generative models’ built-in safety, have raised significant concerns about model vulnerability. While diverse defense methods have been proposed, the underlying mechanisms governing the trade-offs between model safety and helpfulness, and their application to Large Vision-Language Models (LVLMs) remain insufficiently explored. This paper systematically investigates jailbreak defense mechanisms by reformulating the standard generation task as a binary classification problem to probe model refusal tendencies across both harmful and benign queries. Our analysis identifies two key defense mechanisms: safety shift, which generally increases refusal probabilities for all queries, and harmfulness discrimination, which enhances the model’s ability to distinguish between benign and harmful queries. Leveraging these mechanisms, we design two ensemble defense strategies—inter-mechanism and intra-mechanism ensembles—to explore the safety-helpfulness balance. Empirical evaluations on the MM-SafetyBench and MOSSBench datasets on top of LLaVA-1.5 models demonstrate the effectiveness of these ensemble approaches in either enhancing model safety or achieving an improved safety-utility balance. These findings offer valuable insights into jailbreak defense strategies and contribute to the development of more resilient LVLM safety systems.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10764

Loading