Prediction Inconsistency Helps Generalizable Detection of Adversarial Examples

Prediction Inconsistency Helps Generalizable Detection of Adversarial Examples

ICLR 2026 Conference Submission16739 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: adversarial examples, adversarial detection, black-box detection

TL;DR: This work proposes PID, a lightweight and generalizable detection framework to detect AEs by capturing the prediction inconsistency between the primal and auxiliary models.

Abstract: A common way to defend models against adversarial examples (AEs) is to detect them based on their different properties from normal examples (NEs). However, current detection methods often suffer from poor generalization across model types or attack algorithms. In this work, we observe that an auxiliary model, with a different training strategy or architecture from the (target) primal model, tends to predict \textit{differently} on the primal model's AEs but \textit{similarly} on NEs. To this end, we propose Prediction Inconsistency Detection (PID), which simply leverages the above model prediction inconsistency, without training any detector. Experiments on CIFAR-10 and ImageNet demonstrate the superiority of our PID over 5 state-of-the-art detection methods. Specifically, PID achieves an improvement of 4.70\%$\sim$8.44\%, no matter whether the primal model is naturally or adversarially trained, and across 3 white-box, 3 black-box, and 1 mixed attack algorithms. We also show that using a naturally trained primal model and adversarially trained auxiliary model in PID yields a high AUC of 91.92\% (84.43\%) against strong, adaptive attacks on CIFAR-10 (ImageNet).

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 16739

Loading