How Jailbreak Defenses Work and Ensemble?  A Mechanistic Investigation

How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation

ACL ARR 2025 February Submission4801 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Jailbreak attacks, where harmful prompts bypass a model’s safety measures, raise serious concerns about model security. While many defense methods exist, the trade-offs between safety and helpfulness, especially in Large Vision-Language Models (LVLMs), are not well understood. This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem to assess model refusal tendencies for both harmful and benign queries. We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model’s ability to differentiate between harmful and benign inputs. Using these mechanisms, we develop two ensemble defense strategies—inter-mechanism and intra-mechanism ensembles—to balance safety and helpfulness. Experiments on the MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these strategies effectively improve model safety or optimize the trade-off between safety and helpfulness.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: model bias/fairness evaluation; model bias/unfairness mitigation

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 4801

Loading