LLM Improvement for Jailbreak Defense: Analysis Through the Lens of Over-Refusal

Published: 12 Oct 2024, Last Modified: 26 Nov 2024SafeGenAi PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Safety Alignment, LLM Improvement, Robustness to Jailbreaks
Abstract: We propose self and external improvement in Large Language Models (LLMs) as training-free defense mechanisms against jailbreaks and compare performance with existing defenses. Current evaluation strategies are inadequate for comparing various defense methodologies since they predominantly focus on the safety goal of decreasing Attack Success Rate (ASR). Consequently, evaluations fail to capture {\em over-refusal} --- wherein LLMs inappropriately reject benign prompts, leading to compromised utility and user dissatisfaction. To address this gap, we also introduce a comprehensive evaluation framework to facilitate better comparison across defense methodologies, analogous to comparing binary classifiers. Our experimental results on state-of-the-art jailbreaks on Llama-2 models show that LLM self-improvement can significantly reduce ASR (e.g., from 46\% to 0\% on GCG attacks) while minimizing degradation in general instruction-following performance and over-refusal. Furthermore, we identify alarmingly high over-refusal (as high as 100\%) in current defense approaches, underscoring the need for future research into more effective and practical jailbreak defense solutions.
Submission Number: 144
Loading