Toward Robust Learning via Core Feature-Aware Adversarial Training

Published: 01 Jan 2025, Last Modified: 14 Jul 2025IEEE Trans. Inf. Forensics Secur. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Deep neural networks (DNNs) are inherently vulnerable to adversarial examples (AEs), severely deteriorating model performance on various tasks. Adversarial training (AT) is one of the most effective approaches to enhance model robustness by incorporating AEs into the training process. Notwithstanding the efficacy of AT, recent studies have unveiled that adversarial perturbations on AEs predominantly impact core features—essential for accurate predictions—more than spurious features, which are incidentally aligned with training labels but irrelevant to the model’s classification. This unequal impact induces the models trained with AT to excessively rely on spurious features, resulting in a pronounced feature shift that compromises robustness and generalization against AEs at inference. In this work, we introduce a novel Core Feature-aware Adversarial Training (CoFAT) framework to cope with these challenges. CoFAT employs core feature extraction to dynamically generate core partners by selectively retaining benign sample regions on feature maps with high-weight while masking low-weight ones, thereby ensuring the model focuses on core features. Furthermore, contrastive feature alignment is proposed to reduce intra-class feature distances and increase inter-class separability by maintaining a center bank of class feature representations, thus mitigating reliance on spurious features. Compared to state-of-the-art AT methods, CoFAT demonstrates superior performance against diverse adversarial attacks. Remarkably, CoFAT improves the robustness of ResNet-18 against AutoAttack on CIFAR-10, SVHN, CIFAR-100, and Tiny ImageNet by approximately 2.14%, 3.20%, 1.69%, and 1.86%, respectively, embodying significant advancements in AT. Our code is publicized at https://github.com/Feng-peng-Li/CoFAT
Loading