Track: Security and privacy
Keywords: LLM Jailbreak, Jailbreak Evaluation, LLM Performance Downgrade, LLM Benchmark
TL;DR: This paper examines the side effects of jailbreak defense strategies in LLMs, evaluates the dilemma between safety upgrades and performance downgrades across seven leading LLMs and offers insights into balancing security and user experience.
Abstract: With the rise of generative large language models (LLMs) like LLaMA and ChatGPT, these models have significantly transformed daily life and work by providing advanced insights. However, as jailbreak attacks continue to circumvent built-in safety mechanisms, exploiting carefully crafted scenarios or tokens, the safety risks of LLMs have come into focus. While numerous defense strategies—such as prompt detection, modification, and model fine-tuning—have been proposed to counter these attacks, a critical question arises: do these defenses compromise the utility and usability of LLMs for legitimate users? Existing research predominantly focuses on the effectiveness of defense strategies without thoroughly examining their impact on performance, leaving a gap in understanding the trade-offs between LLM safety and performance.
Our research addresses this gap by conducting a comprehensive study on the utility degradation, safety elevation, and exaggerated-safety escalation of LLMs with jailbreak defense strategies. We propose _**USEBench**_, a novel benchmark designed to evaluate these aspects, along with _**USEIndex**_, a comprehensive metric for assessing overall model performance. Through experiments on seven state-of-the-art LLMs, we found that mainstream jailbreak defenses fail to ensure both safety and performance simultaneously. Although model-finetuning performs the best overall, their effectiveness varies across LLMs. Furthermore, vertical comparisons reveal that developers commonly prioritize performance over safety when iterating or fine-tuning their LLMs.
Submission Number: 1594
Loading