Disentangling goal and framing for detecting LLM jailbreaks

Published: 02 Mar 2026, Last Modified: 02 Mar 2026ICLR 2026 Trustworthy AIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Representation Disentanglement, LLM Safety, Goal–Framing Disentanglement
TL;DR: We introduce a self-supervised framework for disentangling semantic factors in LLM representations and apply it to disentangling goal and framing in jailbreak prompts, leading to an efficient detection of goal-preserving prompt framing jailbreaks.
Abstract: Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent, and therefore difficult to detect with standard heuristics. A particularly challenging failure mode occurs when an attacker tries to hide the malicious goal of their request by manipulating its framing to induce compliance. Because these attacks maintain malicious intent through a flexible presentation, defenses that rely on structural artifacts or goal-specific signatures can fail. Motivated by this, we introduce a self-supervised framework for disentangling semantic factor pairs in LLM activations at inference. We instantiate the framework for goal and framing using contrastive prompt pairs constructed with controlled goal and framing variations, and train **Re**presentation **D**isentanglement on **Act**ivations (*ReDAct*) to extract disentangled representations in a frozen LLM. We then propose *FrameShield*, an anomaly detector operating on the framing representations, which improves model-agnostic detection across multiple LLM families with minimal computational overhead. Theoretical guarantees for ReDAct and extensive empirical validations show that its disentanglement effectively powers FrameShield. Finally, we use disentanglement as an interpretability probe, revealing distinct profiles for goal and framing signals and positioning semantic disentanglement as a building block for both LLM safety and mechanistic interpretability.
Submission Number: 120
Loading