Watch Your Words: Successfully Jailbreak LLM by Mitigating the "Prompt Malice"

Xiaowei Xu, Yixiao Xu, Xiong Chen, Peng Chen, Mohan Li, Yanbin Sun

Published: 01 Jan 2024, Last Modified: 20 May 2025APWeb/WAIM (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large Language Models (LLMs) have shown outstanding generative and contextual understanding capabilities, and have been widely used and deployed in various applications. Alignment can significantly enhance the security of LLM. However, even aligned LLMs are still vulnerable to persistent jailbreak attacks. In this paper, we propose a novel jailbreak attack against LLMs. We observe that kind words have an inducing effect, which can be leveraged to increase the attack success rate of jailbreaks. Based on the above observations, we combine multiple strategies to propose a jailbreak attack based on textual malice mitigation. Experiments on various LLMs demonstrate that a high success rate of the attack can be achieved across different models. Additionally, we conduct zero-shot and few-shot experiments on the jailbreak results, which show that although aligned LLMs can distinguish malicious outputs, they may still be misled by carefully constructed prompts. This conclusion also provides a new perspective for understanding the security mechanisms of LLMs.