Beyond Static Bias: Quantifying Fairness Variability in CheXpert

Published: 29 Sept 2025, Last Modified: 23 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: demographic bias, variability, fairness gaps, reliability, medical imaging
TL;DR: Model fairness can be highly variable even when dataset bias is small and stable.
Abstract: Fairness in machine learning is typically assessed through static point-estimate metrics that overlook the robustness and reliability of model behavior under biased data. We introduce a statistical framework to analyze the relationship between the variability of dataset bias and the variability of a model’s fairness gaps. Using Monte Carlo simulation, we quantify bias in the CheXpert dataset and find that while bias is small, it is consistently stable with near-zero variance across its five most common pathologies. Applying a mixed-effects model, we then examine how this stable bias relates to fairness variability in leaderboard models. We find that model fairness can fluctuate unpredictably even when dataset bias is modest but stable, revealing a hidden robustness failure in fairness evaluations. Our results underscore the need to move beyond static fairness metrics toward evaluation methods that explicitly characterize robustness under subpopulation and distribution shifts, aligning with the broader goals of building reliable machine learning under imperfect data.
Submission Number: 155
Loading