A Few Large Shifts: Layer-Inconsistency Based Minimal Overhead Adversarial Example Detection

30 Mar 2026 (modified: 13 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Deep neural networks (DNNs) are highly susceptible to adversarial examples--subtle, imperceptible perturbations that can lead to incorrect predictions. While detection-based defenses offer a practical alternative to adversarial training, many existing methods depend on external models, complex architectures, or adversarial data, limiting their efficiency and generalizability. We introduce a lightweight, plug-in detection framework that leverages internal layer-wise inconsistencies within the target model itself, requiring only benign data for calibration. Our approach is grounded in the **A Few Large Shifts Assumption**, an empirical hypothesis that adversarial perturbations often induce large, localized violations of *layer-wise Lipschitz continuity* in a small subset of adjacent layer transitions. Building on this, we propose two complementary strategies--**Recovery Testing (RT)** and **Logit-layer Testing (LT)**--to measure intermediate-layer and logit-layer inconsistencies, and fuse them through RLT. Evaluated on CIFAR-10, CIFAR-100, and ImageNet under standard and adaptive threat models, our method achieves strong detection performance with substantially lower overhead than detector families requiring external encoders or reference-set retrieval. Furthermore, our system-level analysis provides a practical threshold-selection rule with a stated lower bound on system accuracy under the metric assumptions used in the analysis.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Hanwang_Zhang3
Submission Number: 8178
Loading