Can a Neural Network that only Memorizes the Dataset be Undetectably Backdoored?

10 Jul 2025 (modified: 10 Jul 2025)ODYSSEY 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: backdoors, AI interpretability
TL;DR: We show how an interpretable neural network can nonetheless be undetectably backdoored
Abstract: Recently, many schemes have been proposed for “backdooring" neural network models. Apart from their relevance to computer security and AI safety they are also related to questions about the limits of interpretability of machine learning models. Intuitively, interpretability of machine learning models and detectability of backdoors should go hand in hand. In this work, we present a very simple network that can perfectly perform a classification task on a given dataset and analyze whether it can be undetectably backdoored. We show the network achieves its classification effectiveness by “memorizing" the dataset, despite the fact the dataset contains $O(nd)$ parameters and the network can be described by only $O(n+d)$ parameters. Moreover, despite being fully interpretable we argue the network can still be undetectably backdoored, unless one has full knowledge of the dataset. Even in cases where the backdoor can be detected not much can be learned about the inputs the attacker can use to trigger it.
Serve As Reviewer: ~Matjaz_Leonardis1
Confirmation: I confirm that I and my co-authors have read the policies are releasing our work under a CC-BY 4.0 license.
Submission Number: 28
Loading