Is Your Jailbreaking Prompt Truly Effective for Large Language Models?

ICLR 2024 Workshop SeT LLM Submission90 Authors

Published: 04 Mar 2024, Last Modified: 19 Apr 2024SeT LLM @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Alignment, Jailbreak attack
TL;DR: we analyzed current evaluation methods for assessing jailbreak success, identifying their shortcomings.
Abstract: Despite the widespread use of large language models (LLMs), there is a growing concern of their disregaring human ethics and generating harmful content. While a series of studies are dedicated to aligning LLMs with human values, jailbreaking attacks are also designed to bypass the alignment and solicit malicious outputs from LLMs through manually-/auto-generated prompts. While jailbreaking attacks and defenses claim to either enhance or diminish the success rate of jailbreaks, how the success is being identified is often overlooked. Without proper and acknowledge evaluation method, the research resources devoted can end up in vein, and unfortunately, existing evaluation methods all exhibit flaws of varying degrees. In this paper, we analyzed current evaluation methods for jailbreak, grouped them into 5 categories, identified their shortcomings, and revealed 6 root causes behind them.
Submission Number: 90
Loading