Evaluating the Adversarial Robustness of CNNs Layer by Layer

Published: 23 Feb 2026, Last Modified: 23 Feb 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: In order to measure the adversarial robustness of a feature extractor, Bhagoji et al. introduced a distance on example spaces measuring the minimum perturbation of a pair of examples to achieve identical feature extractor outputs. They related these distances to the best possible robust accuracy of any classifier using the feature extractor. By viewing initial layers of a neural network as a feature extractor, this provides a method of attributing adversarial vulnerability of the classifier as a whole to individual layers. However, this framework views any injective feature extractor as perfectly robust: any bad choices of feature representation can be undone by later layers. Thus the framework attributes all adversarial vulnerabilities to the layers that perform dimensionality reduction. Feature spaces at intermediate layers of convolutional neural networks are generally much larger than input spaces, so this methodology provides no information about the contributions of individual layers to the overall robustness of the network. We extend the framework to evaluate feature extractors with high-dimensional output spaces by composing them with a random linear projection to a lower dimensional space. This results in non-trivial information about the quality of the feature space representations for building an adversarial robust classifier.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Mathematical Refinement;Presentation Improvements;
Assigned Action Editor: ~Venkatesh_Babu_Radhakrishnan2
Submission Number: 5975
Loading