Keywords: Fairness, Explainability, Text Classification, Demographic Parity, Equalized Odds
Abstract: Standard fairness metrics indicate that bias exists in text classifiers but not *where* it manifests or *what* causes it.
We introduce Residual Distribution Fairness (RDF), a framework grounded in a theoretical insight: **demographic parity (DP) and equalized odds (EO) are functions of the residual distribution**, that is, the sorted difference between the estimated and actual predictions.
We prove this connection formally and show that RDF analyzes this distribution directly, providing a rich diagnostic methodology that complements classical metrics DP and EO.
The sorted residual plot visualizes this distribution, revealing where bias manifests across the prediction space.
**Knee points** mark behavioral transitions where examples are particularly informative for diagnostic probing through counterfactual analysis.
We identify a **calibration-diagnosticity relationship**: knee regions concentrate prediction errors when calibration is reasonable (Expected Calibration Error $< 0.15$), providing practitioners clear guidance on when knee-based analysis is most reliable.
Experiments on four NLP datasets validate this framework across calibration regimes.
For well-calibrated models, knee regions concentrate 2--22$\times$ higher prediction errors than non-knee regions ($p<0.001$), enabling targeted diagnostic analysis.
RDF-guided augmentation yields greater fairness improvements than random selection, though with variance that limits statistical confidence.
Null controls with random group assignments confirm the effect is genuine; poorly-calibrated datasets show no RDF advantage, as the theory predicts.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: model bias/fairness evaluation, probing, metrics
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 2545
Loading