Sugar-Coated Poison: Benign Generation Unlocks Jailbreaking

ACL ARR 2025 May Submission1030 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: With the increasingly deep integration of large language models (LLMs) across diverse domains, the effectiveness of their safety mechanisms is encountering severe challenges. Currently, jailbreak attacks based on prompt engineering, which induce models to generate potentially harmful content, have become a major security threat. However, existing methods primarily rely on black-box manipulation of prompt templates, resulting in high costs and poor generalizability. To break through the bottleneck, this study reveals the potential impact of the generation of LLMs on safety for the first time that Defense Threshold Decay (DTD) phenomena: as benign content generation increases, the model’s attention to input instructions progressively diminishes. Building on this insight, we propose the Sugar-Coated Poison (SCP) attack paradigm, using a "semantic reversal" strategy, where benign inputs that are opposite in meaning to malicious intent are crafted to induce the model into a safety response mode. When the defense threshold decays, an adversarial reasoning mechanism easily bypasses safety mechanisms. Experiments show SCP outperforms existing baselines. For defense, we propose Part-of-Speech Defense (POSD), leveraging verb-noun dependencies for syntactic analysis to enhance robustness and security of LLMs. Our code is available at https://anonymous.4open.science/r/SCP-9092.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Ethics, Bias, Fairness
Contribution Types: NLP engineering experiment
Languages Studied: English
Keywords: Ethics, Bias, Fairness
Submission Number: 1030
Loading