Keywords: Robust fairness, adversarial robustness
Abstract: Disparities in class-wise robust accuracies frequently arise in adversarial training, where certain classes suffer significantly lower robustness than others, even when trained on balanced data. This phenomenon has been identified and termed robust fairness in prior work, highlighting the challenge of ensuring equitable robustness across classes.
In this work, we investigate the root causes of such disparities and identify a strong correlation between the norms of head parameters (i.e., the last layer’s weights) and class-wise robust accuracies. Our theoretical and empirical analyses show that adversarial training tends to amplify these disparities by disproportionately affecting head norms, which in turn influence class-wise performance.
To address this, we propose a simple yet effective solution that mitigates these imbalances by directly fine-tuning the head parameters while keeping the feature extractor fixed. Unlike existing methods that rely on class reweighting or remargining strategies, our approach requires no validation set and introduces minimal computational overhead.
Experiments across various datasets and architectures demonstrate that our method significantly reduces disparities in class-wise robust accuracies without degrading overall performance, providing a practical and principled step toward improving robust fairness in adversarial learning.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14772
Loading