From Rejection to Acceptance: Model Editing Guided by Representation Transition for Jailbreak Backdooring LLMs

From Rejection to Acceptance: Model Editing Guided by Representation Transition for Jailbreak Backdooring LLMs

ICLR 2026 Conference Submission18669 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model Safety; Jailbreak Attack; Model Editing

TL;DR: A jailbreak backdoor attack through model editing guided by representation transition, which not only achieves high jailbreak effectiveness but also requires no binding to any phrases, reducing attack costs.

Abstract: Model editing-based jailbreak backdoor attacks against LLMs have gained attention for their lightweight nature and universality, enabling vulnerability discovery in LLMs. Existing methods are implemented by forcibly binding backdoors to predefined phrases, which exploits the next-token prediction strategy when LLM generates content. However, their effectiveness is heavily dependent on the number of bound predefined phrases, with attack costs rising as this number increases. In this work, we propose JEST, which achieves jailbreak backdoor attacks by hijacking LLM representations into a acceptance domain without requiring any phrase binding. Specifically, we propose a representation transition-guided model editing to inject jailbreak backdoors into LLMs. The activated backdoor transitions the LLM from the rejection domain to the acceptance domain, causing it to accept and generate jailbreak behavior. To clearly distinguish between rejection and acceptance domains within LLMs, we also design a domain modeling strategy for JEST that models these two opposing domains within the representation space. Additionally, JEST-hijacked LLMs exhibit greater vulnerability to direct prompt attacks and stronger jailbreak capabilities. Experimental results show that JEST demonstrates stronger jailbreak attack capabilities across multiple LLMs and datasets, surpassing existing model editing-based methods. We also provide analysis to explore the safety boundary of LLM.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 18669

Loading