DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation

Published: 26 Jan 2026, Last Modified: 26 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM; Model Edit; Backdoor Attack
Abstract: Safety-aligned large language models (LLMs) remain vulnerable to backdoor attacks. Recent model editing-based approaches enable efficient backdoor injection by directly modifying a small set of parameters to map triggers to attacker-desired behaviors. However, we find that existing editing-based attacks are often unstable under safety alignment: the edited model may start with an affirmative prefix but later revert to refusals during generation. We term this phenomenon \textit{safety fallback}. To mitigate it, we propose \textbf{DualEdit}, a dual-objective model editing framework that simultaneously promotes affirmative tokens and suppresses refusal tokens. DualEdit further addresses two key challenges—objective imbalance and refusal diversity—via two complementary techniques: (1) \textit{Dynamic loss weighting}, which calibrates the relative scales of the two objectives using the pre-edited model to stabilize optimization, and (2) \textit{Value anchoring}, which clusters representative attention value vectors to form compact anchors, reducing conflicts from overly diverse token sets and improving generalization. Experiments on safety-aligned LLMs show that DualEdit improves attack success by 10\% and reduces safety fallback rate by 11\% over baselines. Our code is available at: \url{https://github.com/zhaozetong/DualEdit}.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 18010
Loading