Removing Strong Attribute Bias from Neural Networks with Adversarial Filtering

Removing Strong Attribute Bias from Neural Networks with Adversarial Filtering

TMLR Paper4813 Authors

09 May 2025 (modified: 25 Nov 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Ensuring a neural network is not relying on protected attributes (e.g., race, sex, age) for prediction is crucial in advancing fair and trustworthy AI. While several promising methods for removing attribute bias in neural networks have been proposed, their limitations remain under-explored. To that end, in this work, we mathematically and empirically reveal the limitation of existing attribute bias removal methods in the presence of strong bias and propose a new method that can mitigate this limitation. Specifically, we first derive a general non-vacuous information-theoretical upper bound on the performance of any attribute bias removal method in terms of the bias strength, revealing that they are effective only when the inherent bias in the dataset is relatively weak. Inspired by this theoretical finding, we then propose a new method using an adversarial objective that directly filters out protected attributes in the input space while maximally preserving all other attributes, without requiring any specific target label. The proposed method achieves state-of-the-art performance in both strong and moderate bias settings. We provide extensive experiments on synthetic, image, and census datasets, to verify the derived theoretical bound and its consequences in practice, and evaluate the effectiveness of the proposed method in removing strong attribute bias.

Submission Length: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Novi_Quadrianto1

Submission Number: 4813

Loading