Jailbreak LLMs through Internal Stance Manipulation

Jailbreak LLMs through Internal Stance Manipulation

ACL ARR 2025 May Submission5040 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: To confront the ever-evolving safety risks of LLMs, automated jailbreak attacks have proven effective for proactively identifying security vulnerabilities at scale. Existing approaches, including GCG and AutoDAN, modify adversarial prompts to induce LLMs to generate responses that strictly follow a fixed affirmative template. However, we observed that the reliance on the rigid output template is ineffective for certain malicious requests, leading to suboptimal jailbreak performance. In this work, we aim to develop a method that is universally effective across all hostile requests. To achieve this, we explore LLMs' intrinsic safety mechanism: a refusal stance towards the adversarial prompt is formed in a confined region and ultimately leads to a rejective response. In light of this, we propose Stance Manipulation (SM), a novel automated jailbreak approach that generates jailbreak prompts to suppress the refusal stance and induce affirmative responses. Our experiments across four mainstream open-source LLMs demonstrate the superiority of SM's performance. Under commenly used setting, SM achieves success rates over 77.1\% across all models on Advbench. Specifically, for Llama-2-7b-chat, SM outperforms the best baseline by 25.4\%. In further experiments with extended iterations in a speedup setup, SM achieves over 92.2\% attack success rate across all models. Our code is publicly available at https://anonymous.4open.science/r/Stance-Manipulation-D5F0

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: data ethics, model bias/fairness evaluation, model bias/unfairness mitigation, ethical considerations in NLP applications, transparency, policy and governance, reflections and critiques;

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 5040

Loading