Keywords: Adversarial detection, deep learning, layer inconsistency, robust defense, adaptive attacks.
TL;DR: We propose a lightweight, plug-in adversarial detection method that leverages internal layer inconsistencies in a DNN to detect adversarial examples without requiring adversarial data, complex data structure, or external models.
Abstract: Deep neural networks (DNNs) are highly susceptible to adversarial examples—subtle, imperceptible perturbations that can lead to incorrect predictions. While detection-based defenses offer a practical alternative to adversarial training, many existing methods depend on external models, complex architectures, or adversarial data, limiting their efficiency and generalizability. We introduce a lightweight, plug-in detection framework that leverages internal layer-wise inconsistencies within the target model itself, requiring only benign data for calibration. Our approach is grounded in the **A Few Large Shifts Assumption**, which posits that adversarial perturbations induce large, localized violations of *layer-wise Lipschitz continuity* in a small subset of layers. Building on this, we propose two complementary strategies—**Recovery Testing (RT)** and **Logit-layer Testing (LT)**—to empirically measure these violations and expose internal disruptions caused by adversaries. Evaluated on CIFAR-10, CIFAR-100, and ImageNet under both standard and adaptive threat models, our method achieves state-of-the-art detection performance with negligible computational overhead. Furthermore, our system-level analysis provides a practical method for selecting a detection threshold with a formal lower-bound guarantee on accuracy.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 8211
Loading