Comprehensive Assessment of Jailbreak Attacks Against LLMs

Comprehensive Assessment of Jailbreak Attacks Against LLMs

ACL ARR 2025 February Submission3701 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Jailbreak attacks aim to bypass the LLMs' safeguards. While researchers have proposed different jailbreak attacks in depth, they have done so in isolation---either with unaligned settings or comparing a limited range of methods. To fill this gap, we present a large-scale evaluation of various jailbreak attacks. We collect 17 representative jailbreak attacks, summarize their features, and establish a novel jailbreak attack taxonomy. Then we conduct comprehensive measurement and ablation studies across nine aligned LLMs on 160 forbidden questions from 16 violation categories. Also, we test jailbreak attacks under eight advanced defenses. Based on our taxonomy and experiments, we identify some important patterns, such as heuristic-based attacks could achieve high attack success rates but are easy to be mitigated by defenses. Our study offers valuable insights for future research on jailbreak attacks and defenses and serves as a benchmark tool for researchers and practitioners to evaluate them effectively.

Paper Type: Long

Research Area: Dialogue and Interactive Systems

Research Area Keywords: large language model, jailbreak, safety, defense, benchmark, evaluation and metrics

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 3701

Loading