everyone
since 09 May 2025">EveryoneRevisionsBibTeXCC BY 4.0
To confront the ever-evolving safety risks of LLMs, automated jailbreak attacks have proven effective for proactively identifying security vulnerabilities at scale. Existing approaches, including GCG and AutoDAN, generate adversarial prompts aiming at responding following a specific template. However, the reliance on the rigid output template is ineffective for certain prompts, leading to suboptimal jailbreak performance. In this work, we aim to develop a method that is universally effective across all prompts. We first identify LLMs' intrinsic mechanisms: a refusal stance towards the adversarial prompt is first formed in a confined region, ultimately resulting in rejective response. In light of this, we propose Stance Manipulation (SM), a novel automated jailbreak approach that generate jailbreak prompts to suppress the refusal stance and induce affirmative responses. Our experiments across four mainstream open-source LLMs demonstrate that SM achieves superior performance. In a commonly adopted setup, SM achieves an attack success rate of over 77% across the tested models. Especially for Llama-2-7b-chat, SM outperforms SOTA method by 25.4%. In further experiments with extended iterations in a speed-up setup, SM achieves over 98% attack success across all models.