Abstract: With the development of Large Language Models (LLMs), it is no longer difficult to use them to assist our daily lives. As the scope of use expands, the security issues of the models are also increasing. Among numerous attack methods, jailbreaking attacks represent a significant security threat to LLMs applications. However, prior jailbreaking studies typically relied on manually adjusting prompts or using iterative optimization techniques to refine the prompts, often resulting in inefficiency and low attack success rate (ASR). In this paper, we introduce an efficient and stable method of jailbreaking attack, termed \textit{Victim-Detective Jailbreaking (VDJ)}. This method utilizes the sympathy psychology of the model to conduct jailbreaking attacks on the model. Specifically, we first rewrite the original prompt from the victim’s perspective, then assign the role of detective to the LLMs, allowing them to analyze the suspect’s actions, the model will prioritize empathizing with the "victim" or attempting to resolve the situation. This step-by-step process induces the LLMs to generate the suspect’s actions, facilitating a successful attack. The experimental results show that our method significantly outperforms the baseline in terms of ASR and is able to effectively bypass safeguards. We hope this work raises awareness about the risks posed by subtle and fluid word substitution attacks.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Ethics, Bias, and Fairness
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 3908
Loading