Thwarting finite difference adversarial attacks with output randomization

Haidar Khan; Dan Park; Azer Khan; Bülent Yener

Thwarting finite difference adversarial attacks with output randomization

Haidar Khan, Dan Park, Azer Khan, Bülent Yener

25 Sept 2019 (modified: 22 Jun 2025)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: black box adversarial attacks, adversarial examples, defense, deep learning

TL;DR: Black box adversarial attacks are rendered ineffective by simple randomization of neural network outputs.

Abstract: Adversarial input poses a critical problem to deep neural networks (DNN). This problem is more severe in the "black box" setting where an adversary only needs to repeatedly query a DNN to estimate the gradients required to create adversarial examples. Current defense techniques against attacks in this setting are not effective. Thus, in this paper, we present a novel defense technique based on randomization applied to a DNN's output layer. While effective as a defense technique, this approach introduces a trade off between accuracy and robustness. We show that for certain types of randomization, we can bound the probability of introducing errors by carefully setting distributional parameters. For the particular case of finite difference black box attacks, we quantify the error introduced by the defense in the finite difference estimate of the gradient. Lastly, we show empirically that the defense can thwart three adaptive black box adversarial attack algorithms.

Code: https://gofile.io/?c=Q7x8SX

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/thwarting-finite-difference-adversarial/code)

Original Pdf: pdf

12 Replies

Loading