BACKDOOR COLLAPSE: ELIMINATING UNKNOWN THREATS VIA KNOWN BACKDOOR AGGREGATION IN LANGUAGE MODELS

04 Sept 2025 (modified: 26 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: backdoor
Abstract: Recent advancements in Large Language Model (LLM) safety have revealed that backdoor attacks pose significant threats. Adversaries implant stealthy behaviors that remain dormant under normal inputs but are abruptly triggered by specific patterns. Such vulnerabilities are exacerbated by the widespread practice of downloading pre-trained checkpoints from public repositories and the increasing reliance on large-scale, imperfectly curated datasets. Although existing defense mechanisms demonstrate promising results in specific scenarios, they often rely on impractical assumptions about backdoor triggers or target behaviors, such as known trigger length, fixed poison ratio, or white-box access to attacker objectives. In this paper, we propose Locphylax, a defense framework that requires no prior knowledge of trigger settings, which is based on the key observation that when deliberately injecting known backdoors into an already-compromised model, both existing unknown and newly injected backdoors aggregate in the representation space. Locphylax leverages this phenomenon through a two-stage process: first aggregating backdoor representations by injecting known triggers, then performing recovery fine-tuning to restore benign outputs. Extensive experiments across multiple LLM architectures demonstrate that: (I) Locphylax reduces the average Attack Success Rate to 4.41% across multiple safety benchmarks, outperforming existing baselines by 28.1%~69.3%. (II) Clean accuracy and downstream utility are preserved within 0.5% of the original model, ensuring negligible impact on legitimate tasks. (III) The defense generalizes across supervised fine-tuning, RLHF, and direct model-editing backdoors, confirming its robustness in practical deployment scenarios. Our code is available at ~\url{https://anonymous.4open.science/r/ICLR2026-Locphylax}.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2026/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2111
Loading