Jailbreak LLMs through Internal Stance Manipulation

Jailbreak LLMs through Internal Stance Manipulation

ACL ARR 2025 February Submission5601 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: To confront the ever-evolving safety risks of LLMs, automated jailbreak attacks have proven effective for proactively identifying security vulnerabilities at scale. Existing approaches, including GCG and AutoDAN, generate adversarial prompts aiming at responding following a specific template. However, the reliance on the rigid output template is ineffective for certain prompts, leading to suboptimal jailbreak performance. In this work, we aim to develop a method that is universally effective across all prompts. We first identify LLMs' intrinsic mechanisms: a refusal stance towards the adversarial prompt is first formed in a confined region, ultimately resulting in rejective response. In light of this, we propose Stance Manipulation (SM), a novel automated jailbreak approach that generate jailbreak prompts to suppress the refusal stance and induce affirmative responses. Our experiments across four mainstream open-source LLMs demonstrate that SM achieves superior performance. In a commonly adopted setup, SM achieves an attack success rate of over 77% across the tested models. Especially for Llama-2-7b-chat, SM outperforms SOTA method by 25.4%. In further experiments with extended iterations in a speed-up setup, SM achieves over 98% attack success across all models.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: data ethics; model bias/fairness evaluation; model bias/unfairness mitigation; ethical considerations in NLP applications; transparency; policy and governance; reflections and critiques;

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 5601

Loading