Lower Bounds on the Robustness of Fixed Feature Extractors to Test-time Adversaries

Arjun Nitin Bhagoji; Daniel Cullina; Ben Zhao

Lower Bounds on the Robustness of Fixed Feature Extractors to Test-time Adversaries

Arjun Nitin Bhagoji, Daniel Cullina, Ben Zhao

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone

Keywords: robustness, lower bounds

Abstract: Understanding the robustness of machine learning models to adversarial examples generated by test-time adversaries is a problem of great interest. Recent theoretical work has derived lower bounds on how robust \emph{any model} can be, when a data distribution and attacker constraints are specified. However, these bounds only apply to arbitrary classification functions and do not account for specific architectures and models used in practice, such as neural networks. In this paper, we develop a methodology to analyze the robustness of fixed feature extractors, which in turn provide bounds on the robustness of any classifier trained on top of it. In other words, this indicates how robust the representation obtained from that extractor is with respect to a given adversary. Our bounds hold for arbitrary feature extractors. The tightness of these bounds relies on the effectiveness of the method used to find collisions between pairs of perturbed examples at deeper layers. For linear feature extractors, we provide closed-form expressions for collision finding while for arbitrary feature extractors, we propose a bespoke algorithm based on the iterative solution of a convex program that provably finds collisions. We utilize our bounds to identify the layers of robustly trained models that contribute the most to a lack of robustness, as well as compare the same layer across different training methods to provide a quantitative comparison of their relative robustness. Our experiments establish that each of the following lead to a measurable drop in robustness: i) layers that linearly reduce dimension, ii) sparsity induced by ReLU activations and, iii) mismatches in the attacker constraints at train and test time. These findings point towards future design considerations for robust models that arise from our methodology.

One-sentence Summary: Analytical method to determine the robustness of fixed feature extractors to adversarial examples

Supplementary Material: zip

6 Replies

Loading