Keywords: LLM, Alignment, Jailbreak attack
TL;DR: we analyzed current evaluation methods for assessing jailbreak success, identifying their shortcomings.
Abstract: Despite the widespread use of large language models (LLMs), there is a growing concern of their disregaring human ethics and generating harmful content. While a series of studies are dedicated to aligning LLMs with human values, jailbreaking attacks are also designed to bypass the alignment and solicit malicious outputs from LLMs through manually-/auto-generated prompts.
While jailbreaking attacks and defenses claim to either enhance or diminish the success rate of jailbreaks, how the success is being identified is often overlooked. Without proper and acknowledge evaluation method, the research resources devoted can end up in vein, and
unfortunately, existing evaluation methods all exhibit flaws of varying degrees. In this paper, we analyzed current evaluation methods for jailbreak, grouped them into 5 categories, identified their shortcomings, and revealed 6 root causes behind them.
Submission Number: 90
Loading