AI Models Can Provably Hide Arbitrary Capabilities

20 Sept 2025 (modified: 22 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Backdoors, AI Security, Sandbagging, Language Models, Cryptography
TL;DR: We embed cryptographically hidden computations in AI that evade audits
Abstract: AI models are increasingly being deployed in high-stakes domains, such as healthcare, finance, military, or critical infrastructure. The threat of supply chain attacks on model weights requires white-box audits to ensure model integrity. Such white-box audits fundamentally depend on the ability to elicit undesirable model behavior prior to deployment. In this work, we demonstrate that such audits are insufficient by presenting a capability backdoor attack on model weights that is provably unelicitable. Specifically, we prove that model weights can encode arbitrary input-conditional computations that are cryptographically hard to elicit or interpret even under unrestricted white-box access to the model weights. We empirically validate that our class of attacks resists elicitation via fine-tuning and is robust to noise. Importantly, our work questions the reliability of established auditing methods. With this paper, we aim to facilitate the development of stronger defense methods by providing models to test against and presenting possible future mitigations.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 23164
Loading