Fairness Overfitting in Machine Learning: An Information-Theoretic Perspective

Firas Laakom; Haobo Chen; Jürgen Schmidhuber; Yuheng Bu

Fairness Overfitting in Machine Learning: An Information-Theoretic Perspective

Firas Laakom, Haobo Chen, Jürgen Schmidhuber, Yuheng Bu

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a theoretical framework for analyzing fairness overfitting through an information-theoretic lens

Abstract: Despite substantial progress in promoting fairness in high-stake applications using machine learning models, existing methods often modify the training process, such as through regularizers or other interventions, but lack formal guarantees that fairness achieved during training will generalize to unseen data. Although overfitting with respect to prediction performance has been extensively studied, overfitting in terms of fairness loss has received far less attention. This paper proposes a theoretical framework for analyzing fairness generalization error through an information-theoretic lens. Our novel bounding technique is based on Efron–Stein inequality, which allows us to derive tight information-theoretic fairness generalization bounds with both Mutual Information (MI) and Conditional Mutual Information (CMI). Our empirical results validate the tightness and practical relevance of these bounds across diverse fairness-aware learning algorithms. Our framework offers valuable insights to guide the design of algorithms improving fairness generalization.

Lay Summary: As machine learning is increasingly used in high-stakes areas like hiring, healthcare, and lending, ensuring fairness is more important than ever. Many existing methods aim to make models fair during training—but here’s the catch: does fairness on training data guarantee fairness in the real world, on unseen data? Our paper shed some light on this question and finds that models can exhibit "fairness overfitting," where fairness achieved during training does not carry over to new data. While generalization in terms of accuracy is well studied, how fairness overfitting remains poorly understood. We address this gap by introducing a new theoretical framework that uses tools from information theory to measure fairness overfitting. We develop a novel mathematical proof that leads to tighter and more insightful bounds predicting how fairness on training data will generalize to unseen data. We test these bounds across several scenarios and find they work well in practice. This work offers a deeper understanding and practical guidance for designing machine learning models that stay fair beyond the training data.

Primary Area: Social Aspects->Fairness

Keywords: information-theoretic bounds, fairness, generalization error bounds, fairness overfitting

Submission Number: 11625

Loading