Dual Intention Escape: Jailbreak Attack against Large Language Models

Yanni Xue; Jiakai Wang; Zixin Yin; Yuqing Ma; Haotong Qin; Renshuai Tao; Xianglong Liu

Dual Intention Escape: Jailbreak Attack against Large Language Models

Yanni Xue, Jiakai Wang, Zixin Yin, Yuqing Ma, Haotong Qin, Renshuai Tao, Xianglong Liu

Published: 29 Jan 2025, Last Modified: 29 Jan 2025WWW 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Track: Security and privacy

Keywords: large language models, jailbreak attacks, dual intention escape

TL;DR: jailbreak attacks against large language models

Abstract: Recently, the jailbreak attack, which generates adversarial prompts to bypass safety measures and mislead large language models (LLMs) to output harmful answers, has attracted extensive interest due to its potential to reveal the vulnerabilities of LLMs. However, ignoring the exploitation of the characteristics in intention understanding, existing studies could only generate prompts with weak attacking ability, failing to evade defenses (e.g., sensitive word detect) and causing malice(e.g., harmful outputs). Motivated by the mechanism in the psychology of human misjudgment, we propose a dual intention escape (DIE) jailbreak attack framework to generate more stealthy and toxic prompts to deceive LLMs to output harmful content. For stealthiness, inspired by the anchoring effect, we designed the Intention-anchored Malicious Concealment(IMC) module that hides the harmful intention behind a generated anchor intention by the recursive decomposition block and contrary intention nesting block. Since the anchor intention will be received first, the LLMs might pay less attention to the harmful intention and enter response status. For toxicity, we propose the Intention-reinforced Malicious Inducement (IMI) module based on the availability bias mechanism in a progressive malicious prompting approach. Due to the ongoing emergence of statements correlated to harmful intentions, the output content of LLMs will be closer to these more accessible intentions, $\textit{i.e.}$, more toxic. We conducted extensive experiments under black-box settings, supporting that DIE could achieve 100\% ASR-R and 92.9\% ASR-G against GPT3.5-turbo.

Submission Number: 1784

Loading