Disconnects between Dataset Representativeness and Group Algorithmic Fairness

Disconnects between Dataset Representativeness and Group Algorithmic Fairness

TMLR Paper4180 Authors

11 Feb 2025 (modified: 08 May 2025)Withdrawn by AuthorsEveryoneRevisionsBibTeXCC BY 4.0

Abstract: There have been numerous demonstrations that the prediction performance of machine learning algorithms differs among groups of people. This causes significant concerns about long-term social impact, including the perpetuation of disadvantages for certain populations. A common explanation is that disparity in performance (i.e., group unfairness) results from differences in group representation in datasets. Recent research has started to explore this explanation and proposed methods to address group unfairness by modulating group representation. We establish that, contrary to conventional wisdom, there exists a fundamental tradeoff between representativeness and group fairness. First, we theoretically describe this tradeoff in a simple univariate setting and confirm our theoretic results empirically across several commonly used datasets. To analyze whether these observations hold in more realistic settings, we then model the process of constructing representative datasets from multiple data sources using a multi-armed bandit framework and a novel Bayesian approach. We find that realistic sampling techniques further nuance the relationship between dataset representativeness and fairness. Notably, we show how the theoretically-sound solution of oversampling groups with lower performance may not hold for realistic multi-site data collection. Finally, we postulate that a key driver of unfairness is the extent to which labels are more challenging to predict for some groups than others. To validate this hypothesis, we show that greater model capacity can lead to improved group fairness independently of representation. In summary, we demonstrate how representativeness and group fairness may be at odds, how theoretically justified approaches to improve fairness may not hold true under realistic conditions, and propose a representation-independent method to improve algorithmic fairness.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~changjian_shui1

Submission Number: 4180

Loading