Unprobeable Backdoors: Evading Runtime Detection in Transformers

20 Sept 2025 (modified: 22 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Backdoors, Language Models, Cryptographic Circuits, Runtime Detection, Transformers
TL;DR: We construct and implement cryptographic backdoors in language models that provably evade runtime detection methods, even with full white-box access to model weights and activations.
Abstract: Machine learning models can exhibit critical failures during deployment without any pre-deployment warning signs, particularly through backdoors. Runtime monitoring is a common defense against such failures, but theoretical limitations remain poorly characterized. We introduce a construction of backdoors based on cryptographic circuits in transformer architectures that can evade detection during execution - a stronger guarantee than evading pre-deployment audits. We formalize this through an adversarial framework between an attacker who manipulates a model and a defender who monitors model behavior with full white-box access to weights, activations, and arbitrary probing mechanisms. Under standard cryptographic assumptions, we prove that no polynomial-time defender can detect backdoor activation better than chance. Our empirical implementation demonstrates that conventional detection methods indeed fail to identify these backdoors while successfully detecting simpler variants. This work provides both a concrete framework for developing detection methods and fundamental insights into the limitations of runtime monitoring, with significant implications for AI security and safety.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 23178
Loading