PatchSAGE: A Probe-Based Detector Using Saliency Alignment, Gradients, and Layer Sensitivity

ICLR 2026 Conference Submission20799 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: adversarial attacks, adversarial defenses, adversarial patch detection, explainable ai defense, adversarial robustness, brain-inspired ai
TL;DR: We propose a model-agnostic adversarial patch detector that aligns machine focus with human visual saliency, achieving state-of-the-art robustness with high precision and explainability.
Abstract: Adversarial patches cause targeted misclassification by steering a model’s evidence toward a small, visible region while human perception remains largely unaffected. We propose PatchSAGE, a post hoc, model-agnostic detector that attaches lightweight probes to a frozen classifier and fuses three complementary signals: (i) input-gradient statistics of the predicted class, (ii) layer-wise sensitivity to small activation noise, and (iii) human--model saliency alignment, quantified by comparing Grad-CAM with human saliency maps. Features from these probes are fed to a small secondary classifier (detector) that predicts whether an input is patched. To our knowledge, PatchSAGE is the first adversarial-patch detector to explicitly incorporate human attention modeling via saliency alignment, aligning what the model relies on with where humans look, without modifying or retraining the base model. Across CAT2000, FIGRIM, and SALICON, using ResNet-50 and EfficientNet-B0 backbones, PatchSAGE achieves F1 scores up to 99.6\% and remains in the 85–99\% range across settings, outperforming probing baselines, SentiNet, and X-Detect in our setting. Ablations show monotonic gains from adding gradients and alignment to sensitivity, indicating complementary cues and highlighting alignment’s discriminative power. PatchSAGE is simple to deploy (post hoc; no retraining) and provides interpretable rationales via its saliency and sensitivity components, suggesting a practical path to robust, explainable detection of adversarial patches.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20799
Loading