Abstract: Jailbreak attacks aim to bypass the LLMs' safeguards.
While researchers have proposed different jailbreak attacks in depth, they have done so in isolation---either with unaligned settings or comparing a limited range of methods.
To fill this gap, we present a large-scale evaluation of various jailbreak attacks.
We collect 17 representative jailbreak attacks, summarize their features, and establish a novel jailbreak attack taxonomy.
Then we conduct comprehensive measurement and ablation studies across nine aligned LLMs on 160 forbidden questions from 16 violation categories.
Also, we test jailbreak attacks under eight advanced defenses.
Based on our taxonomy and experiments, we identify some important patterns, such as heuristic-based attacks could achieve high attack success rates but are easy to be mitigated by defenses.
Our study offers valuable insights for future research on jailbreak attacks and defenses and serves as a benchmark tool for researchers and practitioners to evaluate them effectively.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: large language model, jailbreak, safety, defense, benchmark, evaluation and metrics
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 3701
Loading