Adversarially-robust probes for Deep Networks

Published: 29 Sept 2025, Last Modified: 24 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Adversarial attacks, Deep Networks, Probes
TL;DR: We show that Deep Nets trained with standard methods have some adversarial robustness baked into their internal representations; we build probes to extract such robustness.
Abstract: Adversarial perturbations are strategic manipulations of input by an adversary that are aimed to cause a Deep Network to misclassify the input. Since such perturbations are employed to malicious ends, defending against them has become an important research direction. Here, we consider the question of whether the high-dimensional geometry of internal representations of Deep Networks trained with standard methods can be used to derive predictions that are robust to adversarial perturbations directed at them. To this end, we design probes on layerwise representations, whose parameters can be directly determined from the training data and/or adversarial versions thereof. We show, empirically, that such probes can have adversarial robustness that is significantly better than that of the base network, even though the probes and the base network have an identical initial substrate.
Submission Number: 205
Loading