Abstract: While various effective methods exist to defend against jailbreak attacks, prefilling jailbreak attacks remains a persistent and widespread threat to open-source LLMs. Several defensive solutions have been proposed, yet the issue of over-defense has not been thoroughly analyzed, posing a significant challenge to their effectiveness.
In this paper, we identify the root cause of the over-defense issue for solutions based on both In-Context Learning (ICL) and fine-tuning (FT), highlighting the inherent trade-off between defending against harmful queries and over-defending benign queries.
Surprisingly, our analysis indicates that the mechanism of over-defense in ICL and FT is identical. For ICL-based defense, over-defense is caused by the fact that LLMs only tend to follow the refusal answers, ignoring the information in harmful questions in the ICL demonstration. The over-defense can be alleviated by injecting benign questions and affirmative answers in the ICL demonstrations, but it still cannot be solved from the root. For FT, the generalization of refusal behavior from the harmful training dataset to benign testing dataset is the major factor for over-defense. Therefore, we conclude that there is no free lunch when defending against prefilling jailbreak attacks.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: safety and alignment,robustness
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 5754
Loading