Jailbreaking Large Language Models Using Victim-Detective Strategies

Jailbreaking Large Language Models Using Victim-Detective Strategies

ACL ARR 2025 February Submission3908 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: With the development of Large Language Models (LLMs), it is no longer difficult to use them to assist our daily lives. As the scope of use expands, the security issues of the models are also increasing. Among numerous attack methods, jailbreaking attacks represent a significant security threat to LLMs applications. However, prior jailbreaking studies typically relied on manually adjusting prompts or using iterative optimization techniques to refine the prompts, often resulting in inefficiency and low attack success rate (ASR). In this paper, we introduce an efficient and stable method of jailbreaking attack, termed \textit{Victim-Detective Jailbreaking (VDJ)}. This method utilizes the sympathy psychology of the model to conduct jailbreaking attacks on the model. Specifically, we first rewrite the original prompt from the victim’s perspective, then assign the role of detective to the LLMs, allowing them to analyze the suspect’s actions, the model will prioritize empathizing with the "victim" or attempting to resolve the situation. This step-by-step process induces the LLMs to generate the suspect’s actions, facilitating a successful attack. The experimental results show that our method significantly outperforms the baseline in terms of ASR and is able to effectively bypass safeguards. We hope this work raises awareness about the risks posed by subtle and fluid word substitution attacks.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: Ethics, Bias, and Fairness

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 3908

Loading