Enhancing Jailbreak Attacks on Large Language Models: A Diversity-Driven Optimization Approach

ACL ARR 2024 December Submission1833 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: As large language models (LLMs) become increasingly prevalent in practical applications, concerns regarding their security have raised significant societal attention. Jailbreak attacks, which aim to identify triggers that provoke LLMs into generating harmful or toxic responses, have emerged as a critical area of LLM safety. Many red-teaming efforts focus on exploiting vulnerabilities in LLM security mechanisms by attempting to jailbreak these models. Despite advances in current jailbreaking techniques, their performance remains unsatisfactory. In this paper, we demonstrate that existing jailbreak algorithms optimize triggers within a limited search space, which compromises the effectiveness of these attacks. To address this limitation, we propose an enhancement to jailbreak attacks through the incorporation of diversity guidance. We introduce DPP-based Stochastic Trigger Searching (DSTS), a novel optimization algorithm designed to improve jailbreak attack performance. DSTS leverages diversity guidance by integrating stochastic gradient search and Determinantal Point Process (DPP) selection during the optimization process. Extensive experiments and ablation studies validate the effectiveness of the proposed algorithm. Additionally, we apply DSTS to assess the risk boundaries of various LLMs, providing a new perspective on LLM safety evaluation.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Security, Jailbreak Attacks, Language Models
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 1833
Loading