Identifying and Understanding Cross-Class Features in Adversarial Training

Zeming Wei; Steven Y. Guo; Yisen Wang

Identifying and Understanding Cross-Class Features in Adversarial Training

Zeming Wei, Steven Y. Guo, Yisen Wang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We present a novel perspective on adversarial training (AT) by identifying the critical role of cross-class features in achieving robust generalization

Abstract: Adversarial training (AT) has been considered one of the most effective methods for making deep neural networks robust against adversarial attacks, while the training mechanisms and dynamics of AT remain open research problems. In this paper, we present a novel perspective on studying AT through the lens of class-wise feature attribution. Specifically, we identify the impact of a key family of features on AT that are shared by multiple classes, which we call cross-class features. These features are typically useful for robust classification, which we offer theoretical evidence to illustrate through a synthetic data model. Through systematic studies across multiple model architectures and settings, we find that during the initial stage of AT, the model tends to learn more cross-class features until the best robustness checkpoint. As AT further squeezes the training robust loss and causes robust overfitting, the model tends to make decisions based on more class-specific features. Based on these discoveries, we further provide a unified view of two existing properties of AT, including the advantage of soft-label training and robust overfitting. Overall, these insights refine the current understanding of AT mechanisms and provide new perspectives on studying them. Our code is available at https://github.com/PKU-ML/Cross-Class-Features-AT.

Lay Summary: Deep neural networks are vulnerable to adversarial attacks, which are slightly perturbed inputs designed to fool models. Adversarial training (AT) is a leading method to improve model robustness, but its training mechanisms and dynamics are not fully understood. In this paper, we propose a new perspective to study AT by focusing on cross-class features, which are shared by multiple classes. We find that during the initial stage of AT, models tend to learn more cross-class features, but as training progresses, they rely more on class-specific features, leading to robust overfitting. We also show that soft-label training methods can help preserve cross-class features and thus mitigate overfitting. Our findings refine the current understanding of AT mechanisms and provide new insights for studying robust generalization. By identifying the role of cross-class features in AT, our work may inspire more reliable defense strategies against adversarial attacks, improving the safety and robustness of AI systems in critical applications like autonomous driving and cybersecurity.

Primary Area: Social Aspects->Robustness

Keywords: adversarial training, robust overfitting, feature attribution, training dynamics

Submission Number: 6293

Loading