Fairness Beyond Metrics: A Residual Distribution Framework for Text Classification

Fairness Beyond Metrics: A Residual Distribution Framework for Text Classification

ACL ARR 2026 January Submission2545 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Fairness, Explainability, Text Classification, Demographic Parity, Equalized Odds

Abstract: Standard fairness metrics indicate that bias exists in text classifiers but not *where* it manifests or *what* causes it. We introduce Residual Distribution Fairness (RDF), a framework grounded in a theoretical insight: **demographic parity (DP) and equalized odds (EO) are functions of the residual distribution**, that is, the sorted difference between the estimated and actual predictions. We prove this connection formally and show that RDF analyzes this distribution directly, providing a rich diagnostic methodology that complements classical metrics DP and EO. The sorted residual plot visualizes this distribution, revealing where bias manifests across the prediction space. **Knee points** mark behavioral transitions where examples are particularly informative for diagnostic probing through counterfactual analysis. We identify a **calibration-diagnosticity relationship**: knee regions concentrate prediction errors when calibration is reasonable (Expected Calibration Error $< 0.15$), providing practitioners clear guidance on when knee-based analysis is most reliable. Experiments on four NLP datasets validate this framework across calibration regimes. For well-calibrated models, knee regions concentrate 2--22$\times$ higher prediction errors than non-knee regions ($p<0.001$), enabling targeted diagnostic analysis. RDF-guided augmentation yields greater fairness improvements than random selection, though with variance that limits statistical confidence. Null controls with random group assignments confirm the effect is genuine; poorly-calibrated datasets show no RDF advantage, as the theory predicts.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: model bias/fairness evaluation, probing, metrics

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 2545

Loading