Standard adversarial attacks only fool the final layer

Published: 10 Oct 2024, Last Modified: 09 Nov 2024SciForDL PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: An image of a dog adversarially attacked to look like a car still has dog-like early and intermediate layer features
Abstract: This paper presents a surprising empirical phenomenon in the domain of adversarial machine learning: standard adversarial attacks, while successful at fooling a neural network's final classification layer, fail to significantly impact the representations at early and intermediate layers. Through experiments on ResNet152 models finetuned on CIFAR-10, we demonstrate that when an image is adversarially perturbed to be misclassified, its intermediate layer representations remain largely faithful to the original class. Furthermore, we uncover a decoupling effect where attacks trying to fool specific intermediate layers have limited impact on other layers' classifications, both before and after the targeted layer. These findings challenge the conventional understanding of how adversarial attacks operate and suggest that deep networks possess more robust internal representations by default than previously thought.
Style Files: I have used the style files.
Submission Number: 82
Loading