Exploring the Space of Black-box Attacks on Deep Neural Networks

Anonymous

Nov 03, 2017 (modified: Nov 03, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Existing black-box attacks on deep neural networks (DNNs) so far have largely focused on transferability, where an adversarial instance generated for a locally trained model can “transfer" to attack other learning models. In this paper, we propose novel Gradient Estimation black-box attacks for adversaries with query access to the target model, which do not rely on transferability. We also propose strategies to reduce the number of queries required to generate each adversarial sample to a constant. An iterative variant of our attack achieves close to 100% adversarial success rates for both targeted and untargeted attacks on DNNs. We carry out extensive experiments for a thorough comparative evaluation of black-box attacks, and show that the proposed Gradient Estimation attacks outperform all transferability based black-box attacks on both MNIST and CIFAR-10 datasets, achieving adversarial success rates similar to white-box attacks. We also apply the Gradient Estimation attacks against a real-world content moderation classifier hosted by Clarifai. Furthermore, we evaluate black-box attacks against state-of-the- art defenses. We show that the Gradient Estimation attacks are very effective even against these defenses.
  • TL;DR: Query-based black-box attacks on deep neural networks with adversarial success rates matching white-box attacks
  • Keywords: adversarial machine learning, black-box attacks

Loading