Energy-Shaped Manifold Projections Enable Adversarial Detection

Published: 29 Sept 2025, Last Modified: 12 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: adversarial, adversarial attack, distribution shift, energy based, out of distribution, manifold projection, robustness, machine learning, deep learning, fast gradient sign method, fgsm, projected gradient descent, pgd, cifar
Abstract: Adversarial attacks and distribution shift undermine reliability of deep classifiers. We revisit energy‑based out‑of‑distribution (OOD) detection and propose a simple projection head that maps representations onto a learned data manifold and uses the squared norm of the projected vector as an energy score. The training is parallel with classification loss on the classification head and soft energy separation loss on the projection head that pushes adversarial examples to high energy while keeping clean examples at low energy. On a CIFAR-10 (Krizhevsky [2009]) variant with a held‑out 10th class acting as OOD, our method detects both fast gradient sign (FGSM) and projected gradient descent (PGD) adversarial examples even when the classifier remains non‑robust. We study design choices, including hinge versus softplus energy losses, regularization on the projected vector and the importance of normalization layer choice to align train and test statistics. Despite energy separation transferring across attacks, we find little OOD rejection of unrelated images and highlight failure modes. Our work provides a critical analysis of energy‑shaped projections and lays out open problems and possibilities for future research.
Submission Number: 206
Loading