Can't hide behind the frame:  Disentangling goal & framing for detecting LLM jailbreaks

ICLR 2026 Conference Submission22137 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Disentanglement, LLM Jailbreak Detection, Safety
TL;DR: We introduce semantic factor disentanglement for LLM representations, enabling state-of-the-art efficient detection of Prompt Automatic Iterative Refinement (PAIR) jailbreak attacks through disentangling goal and framing, with theoretical guarantees.
Abstract: Despite extensive research in Large language models (LLMs) alignment, LLMs remain vulnerable to jailbreak attacks through sophisticated prompt engineering. One notable red-teaming framework, *Prompt Automatic Iterative Refinement (PAIR)* attack, has remained effective by manipulating the *framing* of requests while preserving malicious *goals*. Motivated by this challenge, we introduce a framework for self-supervised disentanglement of semantic factors in LLM representations, supported by theoretical guarantees for successful separation without fine-tuning. Our proposed framework not only enables us to address adversarial prompt detection, but also contributes to the broader challenge of decomposing intertwined semantic signals in neural representations, with applications in LLM safety and mechanistic interpretability. We demonstrate its effectiveness through a complete pipeline for PAIR attack detection: *PAIR+Framing*, an enhanced dataset with systematic goal-framing variations; *ReDAct* (**Re**presentation **D**isentanglement on **Act**ivations), a module that operationalizes our framework to learn disentangled representations from LLM activations; and *FrameShield*, an efficient anomaly detector leveraging disentangled framing signals. Empirical results show that our pipeline achieves state-of-the-art detection performance across various LLM families, boosting accuracy by up to 21 percentage points with minimal computational overhead. In addition, we provide interpretable insights into how goal and framing information concentrate at different model depths. This work demonstrates that representation-level semantic disentanglement offers both an effective defense against adversarial prompts and a promising direction for mechanistic interpretability in LLM safety.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22137
Loading