REStore: Exploring a Black-Box Defense against DNN Backdoors using Rare Event Simulation

Published: 07 Mar 2024, Last Modified: 07 Mar 2024SaTML 2024EveryoneRevisionsBibTeX
Keywords: deep neural networks, backdoor defense, black-box, trigger reconstruction, input purification
TL;DR: This paper explores rare event simulation as the basis for a simple black-box backdoor diagnosis and trigger recovery method, enabling real-time input purification.
Abstract: Backdoor attacks pose a significant threat to deep neural networks as they allow an adversary to inject a malicious behavior in a victim model during training. This paper addresses the challenge of defending against backdoor attacks in a black-box setting where the defender has a limited access to a suspicious model. In this paper, we introduce Importance Splitting, a Sequential Monte-Carlo method previously used in neural network robustness certification, as an off-the-shelf tool for defending against backdoors. We demonstrate that a black-box defender can leverage rare event simulation to assess the presence of a backdoor, reconstruct its trigger, and finally purify test-time input data in real-time. So-called REStore, our input purification defense proves effective in black-box scenarios because it uses triggers recovered with a query access to a model (only observing its logit, probit, or top-1 label outputs). We test our method on MNIST, CIFAR-10, and CASIA-Webface. We believe we are the first to demonstrate that backdoors may be considered under the lens of rare event simulation. Moreover, REStore is the first one-stage, black-box input purification defense that approaches the performance of more complex comparables. REStore avoids gradient estimation, model reconstruction, or the vulnerable training of additional models.
Submission Number: 33
Loading