No Free Lunch for Prefilling Jailbreak Attack Defense: An Analysis to Over-defense

No Free Lunch for Prefilling Jailbreak Attack Defense: An Analysis to Over-defense

ACL ARR 2025 May Submission5754 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: While various effective methods exist to defend against jailbreak attacks, prefilling jailbreak attacks remains a persistent and widespread threat to open-source LLMs. Several defensive solutions have been proposed, yet the issue of over-defense has not been thoroughly analyzed, posing a significant challenge to their effectiveness. In this paper, we identify the root cause of the over-defense issue for solutions based on both In-Context Learning (ICL) and fine-tuning (FT), highlighting the inherent trade-off between defending against harmful queries and over-defending benign queries. Surprisingly, our analysis indicates that the mechanism of over-defense in ICL and FT is identical. For ICL-based defense, over-defense is caused by the fact that LLMs only tend to follow the refusal answers, ignoring the information in harmful questions in the ICL demonstration. The over-defense can be alleviated by injecting benign questions and affirmative answers in the ICL demonstrations, but it still cannot be solved from the root. For FT, the generalization of refusal behavior from the harmful training dataset to benign testing dataset is the major factor for over-defense. Therefore, we conclude that there is no free lunch when defending against prefilling jailbreak attacks.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: safety and alignment,robustness

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 5754

Loading