MSE-Break: Steering Internal Representations to Bypass Refusals in Large Language Models

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Interpretability, Jailbreaking, Alignment Vulnerabilities, Adversarial Prompting, Model Refusal Behavior, Embedding Manipulation
TL;DR: MSE-Break is an interpretability-driven jailbreak that manipulates internal concept embeddings via soft prompts to bypass model refusals
Abstract: The flexibility of internal concept embeddings in large language models (LLMs) enables advanced capabilities like in-context learning---but also opens the door to adversarial exploitation. We introduce MSE-Break, a jailbreak method that optimizes a soft-prompt prefix via gradient descent to minimize the mean squared error (MSE) between harmful concept embeddings in refused and accepted contexts. The resulting soft prompt $p$ is concept-specific but prompt-general, enabling it to jailbreak a wide range of queries involving that concept without further tuning. Applied to four popular open-source LLMs---including Gemma-2B-IT and LLaMA-3.1-8B-IT---MSE-Break achieves attack success rates exceeding 90\%. Its interpretability-driven design enables MSE-Break to outperform existing methods like GCG and AutoDAN---while converging in a fraction of the time. We find that harmful concept embeddings are linearly separable between refused and accepted contexts---structure that MSE-Break actively exploits. We further show that concept representations can be drastically steered in-context with as little as a single token. Our findings underscore the brittleness of LLM representations---and their susceptibility to targeted manipulation---highlighting the urgency for more robust and interpretable safety mechanisms.
Primary Area: interpretability and explainable AI
Submission Number: 22941
Loading